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To my wife Karin 


Preface 


This book is intended as an introductory course for students in mathematics, phys- 
ical sciences, engineering, or in other related fields. It is based on the experience 
of probability lectures taught during the past 25 years, where the spectrum reached 
from two-hour introductory courses, over Measure Theory and advanced probability 
classes, to such topics as Stochastic Processes and Mathematical Statistics. Until 2012 
these lectures were delivered to students at the University of Jena (Germany), and since 
2013 to those at the University of Delaware in Newark (USA). 

The book is the completely revised version of the German edition “Stochastik fiir 
das Lehramt,” which appeared in 2014 at De Gruyter. At most universities in Germany, 
there exist special classes in Probability Theory for students who want to become 
teachers of mathematics in high schools. Besides basic facts about Probability Theory, 
these courses are also supposed to give an introduction into Mathematical Statistics. 
Thus, the original main intention for the German version was to write a book that helps 
those students understand Probability Theory better. But soon the book turned out to 
also be useful as introduction for students in other fields, e.g. in mathematics, phys- 
ics, and so on. Thus we decided, in order to make the book applicable for a broader 
audience, to provide a translation in the English language. 


During numerous years of teaching I learned the following: 

— Probabilistic questions are usually easy to formulate, generally have a tight rela- 
tion to everyday problems, and therefore attract the interest of the audience. Every 
student knows the phenomena that occur when one rolls a die, plays cards, tosses 
a coin, or plays a lottery. Thus, an initial interest in Probability Theory exists. 

- Incontrast, after a short time many students have very serious difficulties with 
understanding the presented topics. Consequently, a common opinion among 
students is that Probability Theory is a very complicated topic, causing a lot of 
problems and troubles. 


Surely there exist several reasons for the bad image of Probability Theory among stu- 
dents. But, as we believe, the most important one is as follows. In Probability Theory, 
the type of problems and questions considered, as well as the way of thinking, differs 
considerably from the problems, questions, and thinking in other fields of mathem- 
atics, i.e., from fields with which the students became acquainted before attending a 
probability course. For example, in Calculus a function has a well-described domain 
of definition; mostly it is defined by a concrete formula, has certain properties as con- 
tinuity, differentiability, and so on. A function is something very concrete which can 
be made vivid by drawing its graph. In contrast, in Probability Theory functions are 
mostly investigated as random variables. They are defined on a completely unimport- 
ant, nonspecified sample space, and they generally do not possess a concrete formula 
for their definition. It may even happen that only the existence of a function (random 
variable) is known. The only property of a random variable which really matters is 
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the distribution of its values. This and many other similar techniques make the whole 
theory something mysterious and not completely comprehensible. 

Considering this observation, we organized the book in a way that tries to make 
probabilistic problems more understandable and that puts the focus more onto ex- 
planations of the definitions, notations, and results. The tools we use to do this are 
examples; we present at least one before a new definition, in order to motivate it, fol- 
lowed by more examples after the definition to make it comprehensible. Here we act 
upon the maxim expressed by Einstein’s quote!: 


Example isn’t another way to teach, it is the only way to teach. 


Presenting the basic results and methods in Probability Theory without using results, 
facts, and notations from Measure Theoty is, in our opinion, as difficult as to square 
the circle. Either one restricts oneself to discrete probability measures and random 
variables or one has to be unprecise. There is no other choice! In some places, it is 
possible to avoid the use of measure theoretic facts, such as the Lebesgue integral, or 
the existence of infinite product measures, and so on, but the price is high.* Of course, 
I also struggled with the problem of missing facts from Measure Theory while writing 
this book. Therefore, I tried to include some ideas and some results about o-fields, 
measures, and integrals, hoping that a few readers become interested and want to 
learn more about Measure Theory. For those, we refer to the books [Coh13], [Dud02], 
or [Bil12] as good sources. 

In this context, let us make some remark about the verification of the presented 
results. Whenever it was possible, we tried to prove the stated results. Times have 
changed; when I was a student, every theorem presented in a mathematical lecture 
was proved — really every one. Facts and results without proof were doubtful and 
soon forgotten. And a tricky and elegant proof is sometimes more impressive than 
the proven result (at least to us). Hopefully, some readers will like some of the proofs 
in this book as much as we did. 

One of most used applications of Probability Theory is Mathematical Statistics. 
When I met former students of mine, I often asked them which kind of mathematics 
they are mainly using now in their daily work. The overwhelming majority of them 
answered that one of their main fields of mathematical work is statistical problems. 
Therefore, we decided to include an introductory chapter about Mathematical Statist- 
ics. Nowadays, due to the existence of good and fast statistical programs, it is very 
easy to analyze data, to evaluate confidence regions, or to test a given hypotheses. 
But do those who use these programs also always know what they are doing? Since 


1 See http://www.alberteinsteinsite.com/quotes/einsteinquotes.html 

2 For example, several years ago, to avoid the use of the Lebesgue integral, I introduced the expected 
value of a random variable as a Riemann integral via its distribution function. This is mathematically 
correct, but at the end almost no students understood what the expected value really is. Try to prove 
that the expected value is linear using this approach! 
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we doubt that this is so, we stressed the focus in this chapter to the question of why 
the main statistical methods work and on what mathematical background they rest. 
We also investigate how precise statistical decisions are and what kinds of errors may 
occur. 

The organization of this book differs a little bit from those in many other first- 
course books about Probability Theory. Having Measure Theory in the back of our 
minds causes us to think that probability measures are the most important ingredi- 
ent of Probability Theory; random variables come in second. On the contrary, many 
other authors go exactly the other way. They start with random variables, and prob- 
ability measures then occur as their distribution on their range spaces (mostly R). 
In this case, a standard normal probability measure does not exist, only a standard 
normal distributed random variable. Both approaches have their advantages and dis- 
advantages, but as we said, for us the probability measures are interesting in their own 
right, and therefore we start with them in Chapter 1, followed by random variables in 
Section 3. 

The book also contains some facts and results that are more advanced and usually 
not part of an introductory course in Probability Theory. Such topics are, for example, 
the investigation of product measures, order statistics, and so on. We have assigned 
those more involved sections with a star. They may be skipped at a first reading 
without loss in the following chapters. 

At the end of each chapter, one finds a collection of some problems related to the 
contents of the section. Here we restricted ourselves to a few problems in the actual 
task; the solutions of these problems are helpful to the understanding of the presented 
topics. The problems are mainly taken from our collection of homeworks and exams 
during the past years. For those who want to work with more problems we refer to 
many books, as e.g. [GSO1a], [Gha05], [Pao06], or [Ros14], which contain a huge collec- 
tion of probabilistic problems, ranging from easy to difficult, from natural to artificial, 
from interesting to boring. 

Finally I want to express my thanks to those who supported my work at the trans- 
lation and revision of the present book. Many students at the University of Delaware 
helped me to improve my English and to correct wrong phrases and wrong expres- 
sions. To mention all of them is impossible. But among them were a few students 
who read whole chapters and, without them, the book would have never been fin- 
ished (or readable). In particular I want to mention Emily Wagner and Spencer Walker. 
They both did really a great job. Many thanks! Let me also express my gratitude to 
Colleen McInerney, Rachel Austin, Daniel Atadan, and Quentin Dubroff, all students 
in Delaware and attending my classes for some time. They also read whole sections of 
the book and corrected my broken English. Finally, my thanks go to Professor Anne 
Leucht from the Technical University in Braunschweig (Germany); her field of work is 
Mathematical Statistics,and her hints and remarks about Chapter 8 in this book were 
important to me. 
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And last but not least I want to thank the Department of Mathematical Sciences 
at the University of Delaware for the excellent working conditions after my retirement 
in Germany. 


Newark, Delaware, June 6, 2016 Werner Linde 
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1 Probabilities 


1.1 Probability Spaces 


The basic concern of Probability Theory is to model experiments involving ran- 
domness, that is, experiments with nondetermined outcome, shortly called random 
experiments. The Russian mathematician A.N. Kolmogorov established the mod- 
ern Probability Theory in 1933 by publishing his book (cf. [Kol33]) Grundbegriffe der 
Wahrscheinlichkeitsrechnung. In it, he postulated the following: 


Random experiments are described by probability spaces (Q, A, P) 


The triple (Q,.4, P) comprises a sample space Q, a o-field A of events, and a 
mapping P from A to [0, 1], called probability measure or probability distribution. 

Let us now explain the three different components of a probability space in detail. 
We start with the sample space. 


1.1.1 Sample Spaces 


Definition 1.1.1. The sample space © is a nonempty set that contains (at least) all 
possible outcomes of the random experiment. 


Remark 1.1.2. Due to mathematical reasons sometimes it can be useful to choose OQ 
larger than necessary. It is only important that the sample space contains all possible 
results. 


Example 1.1.3. When rolling a die one time the natural choice for the sample space is 
QO = {1, ... , 6}. However, it would also be possible to take O = {1, 2, ...}orevenQ =R. 
In contrast, Q = {1, ... ,5}is not suitable for the description of the experiment. 


Example 1.1.4. Roll a die until the number “6” shows up for the first time. Record the 
number of necessary rolls until the first appearance of “6.” The suitable sample space 
in this case is O = {1,2, ...}. Any finite set {1,2, ... , N} is not appropriate because, 
even if we choose N very large, we can never be 100% sure that the first “6” really 
appears during the first N rolls. 


Example 1.1.5. A light bulb is switched on at time zero and burns for a certain period 
of time. At some random time t > 0 it burns out. To describe this experiment we have 
to take into account all possible times t > 0. Therefore, a natural choice for the sample 
space in this case is O = (0, oo), or, if we do not exclude that the bulb is defective from 
the very beginning, then Q = [0, oo). 
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Subsets of the sample space © are called events. In other words, the powerset P(Q) 
is the collection of all possible events. For example, when we roll a die once there are 
2° = 64 possible events, as, for example, 


{o, {1} vec AOE TD ELBE oe TRS. ,{1,2,3,4,5}, 0} 


Among all events there are some of special interest, the so-called elementary events. 
These are events containing exactly one element. In Example 1.1.3 the elementary 


events are 
{1}, {2}, {3}, {4}, {5} and {6}. 


Remark 1.1.6. Never confuse the elementary events with the points that they contain. 
Look at Example 1.1.3. There we have 6 € Q and for the generated elementary event 
holds {6} « P(Q). 


Let A ¢ QO be an event. After executing the random experiment one observes a result 
w € Q. Then two cases are possible. 

1. The outcome w belongs to A. In this case we say that the event A occurred. 

2. Ifwis not in A, that is, if w ¢ A‘, then the event A did A not occur. 


Example 1.1.7. Roll a die once and let A = {2,4}. Say the outcome was number “6.” 
Then A did not occur. But, if we obtained number “2,” then A occurred. 


Example 1.1.8. In Example 1.1.5 the occurrence of an event A = [T, oo) tells us that the 
light bulb burned out after time T or, in other words, at time T it still shone. 


Let us formulate some easy rules for the occurrence of events. 

1. By the choice of the sample space the event Q always occurs. Therefore, is also 
called the certain event. 

2. The empty set never occurs. Thus it is called the impossible event. 

3. An event A occurs if and only if the complementary event A‘ does not, and vice 
versa, A does not occur if and only if AC does. 

4. IfA and B are two events, then A u B occurs if at least one of the two sets occurs. 
Hereby we do not exclude that A and B may both occur. 

5. Theevent An B occurs if and only if A and B both occur. 


1.1.2 o-Fields of Events 


The basic aim of Probability Theory is to assign to each event A a number P(A) in 
[0, 1], which describes the likelihood of its occurrence. If the occurrence of an event A 
is very likely, then P(A) should be close to 1 while P(A) close to zero suggests that the 
appearance of A is very unlikely. The mapping A +> P(A) must possess certain natural 
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properties. Unfortunately, by mathematical reason it is not always possible to assign 
to each event A a number P(A) such that A + P(A) has the desired properties. The 
solution is ingenious and and one of the key observations in Kolmogorov’s approach: 
one chooses a subset A ¢ P(Q) such that P(A) is only defined for A ¢ A. If A ¢ A, then 
P(A) does not exist. Of course, A should be chosen as large as possible and, moreover, 
at least “ordinary” sets should belong to A. 

The collection A of events has to satisfy some algebraic conditions. More pre- 
cisely, the following properties are supposed. 


Definition 1.1.9. A collection A of subsets of Q is called o-field if 
(1) geA, 

(2) ifMde Athen A<« A, and 

(3) for countably many A), Ao, ... in A follows Uy Aye A. 


Let us verify some easy properties of o-fields. 


Proposition 1.1.10. Let A be ao-field of subsets of O. Then the following are valid: 


Gi) QeA. 

(ii) If A), ...,Ay are finitely many sets in A, then Uj Aj eA. 
(iii) If A, A2, ... belong to A, then so does are Aj. 
(iv) Whenever Aj, ... , An € A, then Na Aje A. 


Proof: Assertion (i) is a direct consequence of @ « A combined with property (2) of 
o-fields. 

To verify (ii) let Ai, ... , An bein A. Set Ans = Ansz =---=@. Then for allj =1, 2, ... 
we have A; « A and by property (3) of o-fields also Uy Aj « A. But note that Uy Aj = 
Una Aj, hence (ii) is valid. 

To prove (iii) we first observe that A; ¢ A yields As e A, hence |_J jan As e A. Another 
application of (2) implies (UJ pa Ac)‘ e A. De Morgan’s rule asserts 


which completes the proof of (iii). 

Assertion (iv) may be derived from an application of (ii) to the complementary sets 
as we did in the proof of (iii). Or use the method in the proof of (ii), but this time we 
choose Ani, = Ania= °° =. | 


Corollary 1.1.11. If sets A and B belong to a o-field A, then so do AU B, An B, A\B, and 
AAB. 
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The easiest examples of o-fields are either A = {4,0} or A = P(Q). However, the 
former o-field is much too small for applications while the latter one is generally too 
big, at least if the sample space is uncountably infinite. We will shortly indicate how 
one constructs suitable o-fields in the case of “large” sample spaces as, for example, 
Ror R". 


Proposition 1.1.12. Let C be an arbitrary nonempty collection of subsets of QO. Then 

there is a o-field A possessing the following properties: 

1. It holds C ¢ Aor, verbally, each set C € C belongs to the o-field A. 

2. The o-field A is the smallest one possessing this property. That is, whenever A’ is 
another o-field withC ¢ A’, then Ac A’. 


Proof: Let © be the collection of all o-fields A’ on Q for which C ¢ A’, that is, 
® := {A’c PO):CEA’, A’ isa o-field}. 


The collection © is nonempty because it contains at least one element, namely the 
powerset of Q. Of course, P(Q) is a o-field and C ¢ P(Q) by trivial reason, hence 
P(Q)€O@. 

Next define A by 


A:= [| A’={AcO:Ac A, VA’ e O}. 
A’e®D 


It is not difficult to prove that A is a o-field with C ¢ A. Indeed, if C « C, then C « A’ 
for all A’ « ®, hence by construction of A we get C « A. 

Furthermore, A is also the smallest o-field containing C. To see this, take an ar- 
bitrary o-field A containing C. Then A ¢ ®, which implies A ¢ A because A is the 
intersection over all o-fields in ©. This completes the proof. o 


Definition 1.1.13. Let C be an arbitrary nonempty collection of subsets of QO. The 
smallest o-field containing C is called the o-field generated by C. It is denoted 
by o(C). 


Remark 1.1.14. o(C) is characterized by the three following properties: 
1. o(C)isao-field. 

2. Cco(C). 

3. IfC ¢ A’ for some o-field A’, then o(C) ¢ A’. 


1.1 Probability Spaces ——= 5 


Definition 1.1.15. Let C ¢ P(R) be the collection of all finite closed intervals in R, 
that is, 


C ={la,b]:a<b,a,beR}. 


The o-field generated by C is denoted by B(R) and is called Borel o-field. If B « 
B(R), then it is said to be a Borel set. 


Remark 1.1.16. By construction every closed interval in R is a Borel set. Furthermore, 
the properties of o-fields also imply that complements of such intervals, their count- 
able unions, and intersections are Borel sets. One might believe that all subsets of R 
are Borel sets. This is not the case; for the construction of a non-Borel set we refer to 
[Gha05], Example 1.21, or [Dud02], pages 105-108. 


Remark 1.1.17. There exist many other systems of subsets in R generating B(R). Let 
us only mention two of them: 


C, ={(-co,b]: be R} or C={(a,~w): ae R}. 


1.1.3. Probability Measures 


The occurrence of an event in a random experiment is not completely haphazardly. 
Although we are not able to predict the outcome of the next trial, the occurrence or 
nonoccurrence of an event follows certain rules. Some events are more likely to oc- 
cur, others less. The degree of likelihood of an event A is described by a number P(A), 
called the probability of the occurrence of A (in short, probability of A). The most com- 
mon scale for probabilities is 0 < P(A) < 1, where the larger P(A) is the more likely is 
its occurrence. One could also think of other scales as 0 < P(A) < 100. In fact, this is 
even quite often used; in this sense a chance of 50% equals a probability of 1/2. 

What does it mean that an event A has probability P(A)? For example, what does 
it tell us that an event occurs with probability 1/2? Does this mean a half-occurrence 
of A? Surely not. 

To answer this question we have to assume that we execute an experiment not 
only once! but several, say n, times. Thereby we have to ensure that the conditions 


1 It does not make sense to speak of the probability of an event that can be executed only once. For 
example, it is (mathematically) absurd to ask for the probability that the Eiffel Tower will be in Paris 
for yet another 100 years. 
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of the experiment do not change and that the single results do not depend on each 
other. Let 


a,(A) := Number of trials where A occurs . 


The quantity a,(A) is called absolute frequency of the occurrence of A in n trials. 
Observe that a,(A) is a random number with 0 < a,(A) < n. Next we set 


Tn(A) := (1.1) 


an(A) 
n 


and name it relative frequency of the occurrence of A in n trials. This number is 
random as well, but now 0 < r,,(A) < 1. 

It is somehow intuitively clear? that these relative frequencies converge to a 
(nonrandom) number as n= oo. And this limit is exactly the desired probability of 
the occurrence of the event A. Let us express this in a different way: say we execute 
an experiment n times for some large n. Then, on average, we will observe n - P(A) the 
occurrence of A. For example, when rolling a fair die many times, an even number will 
be given approximately half the cases. 

Which natural properties of A + P(A) may be deduced from limp... rn(A) = P(A)? 
Since 0 < r,(A) < 1, we conclude 0 < P(A) <1. 

Because of r,(Q) = 1 for each n > 1 we get P(Q) = 1. 

The property r,(@) = 0 yields P(g) = 0. 

Let A and B be two disjoint events. Then r,(A u B) = r;(A) + r,(B), hence the limits 
should satisfy a similar relation, that is, 


Pw Noe 


P(A UB) = P(A) + P(B). (1.2) 


Definition 1.1.18. A mapping P fulfilling eq. (1.2) for disjoint A and B is called 
finitely additive. 


Remark 1.1.19. Applying eq. (1.2) successively leads to the following. If A;, ... , Ay are 
disjoint, then 


n n 
P(U4i) = OP). 
jel jel 

Finite additivity is a very useful property of probabilities, and in the case of finite 
sample spaces, it completely suffices to build a fruitful theory. But as soon as the 


sample space is infinite it is too weak. To see this let us come back to Example 1.1.4. 


2 We will discuss this question more precisely in Section 71. 
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Assume we want to evaluate the probability of the event A = {2,4,6, ...}, that is, 
the first “6” appears at an even number of trials. Then we have to split A into (infin- 
itely) many disjoint events {2}, {4}, ... . The finite additivity of P does not suffice to get 
P(A) = P({2}) + P({4}) + - - -. In order to evaluate P(A) in this way we need the following 
stronger property of P. 


Definition 1.1.20. A mapping P is said to be o-additive provided that for count- 
ably many disjoint A;, Az, ... in Q we get 


P( UA) = sp P(A;) 
am al 


Let us summarize what we have until now: a mapping P assigning each event its 
probability should possess the following natural properties: 

1. For all A holds 0 < P(A) <1. 

2. Wehave P(g) = 0 and P(Q) = 1. 

3. The mapping P has to be o-additive. 


Thus, given a sample space Q, we look for a function P defined on P(Q) satis- 
fying the previous properties. But, as already mentioned, if Q is uncountable, for 
example, © = R, then only very special? P with these properties exist. 

To overcome these difficulties, in such cases we have to restrict P to a o-field 
Ac P(Q). 


Definition 1.1.21. Let O be a sample space and let A be a o-field of subsets 

of ©. A function P : A = [0,1] is called probability measure or probability 

distribution on (Q, A) if 

1. P(@) =0and P(O) =1. 

2. is o-additive, that is, for each sequence of disjoint sets Aj ¢« A,j=1,2,..., 
follows 


»(UA) - y P(A;) . (1.3) 
= 


j=l 


Remark 1.1.22. Note that the left-hand side of eq. (1.3) is well-defined. Indeed, since 
A is a o-field, Aj < A implies Uj", Aj « A as well. 


Now we are in a position to define probability spaces in the exact way. 


3 Discrete ones as we will investigate in Section 1.3. 
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Definition 1.1.23. A probability space is a triple (Q, A, P), where © is a sample 
space, A denotes a o-field consisting of subsets of Q and PP: A = [0,1] isa 
probability measure. 


Remark 1.1.24. Given A « .A, the number P(A) describes its probability or, more pre- 
cisely, its probability of occurrence. Subsets A of OQ with A ¢ A do not possess a 
probability. 


Let us demonstrate a simple example on how to construct a probability space for a 
given random experiment. Several other examples will follow soon. 


Example 1.1.25. We ask for a probability space that describes rolling a fair die one 
time. Of course, O = {1, ...,6} and A = P(Q). The mapping P : P(Q) = [0,1] is 
given by 


pa) = 2, Acf{l,...,6}. 


Recall that #(A) denotes the cardinality of the set A. 


Remark 1.1.26. Suppose we want to find a model for some concrete random experi- 
ment. How do we do this? In most cases the sample space is immediately determined 
by the results we will expect. If the question about Q is settled, the choice of the o-field 
depends on the size of the sample space. Is Q finite or countably finite, then we may 
choose A = P(Q). If OQ = R or even R", we take the corresponding Borel o-fields. The 
challenging task is the determination of the probability measure P. Here the following 
approaches are possible. 

1. Theoretically considerations lead quite often to the determination of P. For ex- 
ample, since the faces of a fair die are all equally likely, this already describes P 
completely. Similar arguments can be used for certain games or also for lotteries. 

2. If theoretical considerations are neither possible nor available then statistically 
investigations may help. This approach is based on the fact that the relative fre- 
quencies r;,(A) converge to P(A). Thus, one executes n trials of the experiment and 
records the relative frequency of the occurrence of A. For example, one may ques- 
tion n randomly chosen persons or one does n independent measurements of the 
same item. Then r,(A) may be used to approximate the value of P(A). 

3. Sometimes also subjective or experience-based approaches can be used to find 
approximative probabilities. These may be erroneous, but maybe they give some 
hint for the correct distribution. For example, if a new product is on the market, 
the distribution of its lifetime is not yet known. At the beginning one uses data of 
an already existing similar product. After some time data about the new product 
become available, the probabilities can be determined more accurately. 
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1.2 Basic Properties of Probability Measures 


Probability measures obey many useful properties. Let us summarize the most import- 
ant ones in the next proposition. 


Proposition 1.2.1. Let (QO, A, P) be a probability space. Then the following are valid. 

(1) Pis also finitely additive. 

(2) IfA,B «A satisfy A ¢ B, then P(B\A) = P(B) - P(A). 

(3) Wehave P(A‘) =1- P(A) forA« A. 

(4) Probability measures are monotone, that is, if ACB for some A, Be A, then 
P(A) < P(B). 

(5) Probability measures are subadditive, that is, for all (not necessarily disjoint) 
events A; € A follows* 


P(U4i) s OP). (1.4) 
jel jel 


(6) Probability measures are continuous from below, that is, whenever A; «<A satisfy 
A,CA2¢ ---, then 


P( ai) = fim P(A). 
j=l 


(7) Ina similar way each probability measure is continuous from above: if Aj <A 
satisfy A;2A22 ---, then 


co 


P(()4)) = lim P@)). 


jal Jeo 


Proof: To prove (1) choose disjoint A;, ... , An in A and set Ani; = Ania =--- =9. Then 
Aj, Az, ... are infinitely many disjoint events in A, hence the o-additivity of P implies 


P( Ua) = 3 P(A)). 
jel jel 


Observe that U*, A; = Uji, Aj and P(A)) 
reduces to 


0 if j > n, so the previous equation 


and P is finitely additive. 


4 Estimate (1.4) is also known as Boole’s inequality. 
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To prove (2) write B = A U (B\A) and observe that this is a disjoint decomposition of B. 
Hence, by the finite additivity of P we obtain 


P(B) = P(A) + P(B\A). 


Relocating P(A) to the left-hand side proves (2). 
An application of (2) to O and A leads to 


P(A‘) = P(Q\A) = P() - P(A) =1- P(A), 


which proves (3). 
The monotonicity is an easy consequence of (2). Indeed, 


P(B) - P(A) = P(B\A) > 0 


implies P(B) > P(A). 
To prove inequality (1.4) choose arbitrary A;, Ao, ... in A. Set By := A; and, ifj > 2, 
then 


B; = Aj\(A1 Ly s+ UAj-1) . 


Then By, Bo, ... are disjoint subsets in A with Uy Bj = Uja A;. Furthermore, by the 
construction holds B; ¢ Aj, hence P(B;) < P(A). An application of all these properties 
yields 


co co 


P(U4i) = P(U8) - YP) < yr) 


jel j=l 


Thus (5) is proved. 
Let us turn now to the continuity from below. Choose A, A>, ... in A satisfying 
A, © A) ¢ ---.With Ao :=@ set 


By = Ax\Ak-1 ’ k= 1,25. 22% 
The Bxs are disjoint and, moreover, 2, Bx = (Uj Aj- Furthermore, because of 


Ar © Ar from (2) we get P(B,) = P(Ax) - P(Ax-1). When putting this all together, it 
follows 


P( Us) = P( Us) = )> PB) 
jel k=1 k=1 
j j 
= lim) P(By) = lim 9 “P(g - P(AK-1)] 
noon aay 
= lim [P(Aj) — P(Ao)| = lim P(Aj) 
J7-o% jroo 


where we used P(Ao) = P(@) = 0. This proves the continuity from below. 
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Thus it remains to prove (7). To this end choose A; € A with Aj 2 A? 2 -- -. Then 
the complementary sets satisfy AJ ¢ AS ¢ - - -. The continuity from below lets us 
conclude that 


co 


P(Uas) = lim P(A‘) lim [1- P(A] =1 lim P(A)). (1.5) 


J=o0 


jz 


But 


co 


»(Uai) =) -»((U4)’) - 1-P((\A), 


jel jel 


and plugging this into eq. (1.5) gives 
P( 14') = lim P(A;) 
jal - 
as asserted. a 


Remark 1.2.2. Property (2) becomes false without the assumption A ¢ B. But since 
B\A = B\(An B) and AnB¢B, we always have 


P(B\A) = P(B) - P(AnB). (1.6) 
Another useful property of probability measures is as follows. 
Proposition 1.2.3. Let (Q,.A, P) be a probability space. Then for all Ay, Az € Ait follows 
P(Ai U Az) = P(Ai) + P(A2) - P(A1n Az). (1.7) 
Proof: Write the union of the two sets as 
Ay U Ap = Ay U [A2\(A1 9 Ad) 


and note that the two sets on the right-hand side are disjoint. Because of A; 9 A2 © A2 
property (2) of Proposition 1.2.1 applies and leads to 


P(A, U Az) = P(Ay) + P(A2\(A1 2 Ap) = P(A)) + [P(A2) - P(A 9 A2)I. 


This completes the proof. a 
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Given A;, Ap, A3 ¢ Aan application of the previous proposition to A; and Aj U A3 
implies 


P(A, U Ad U A3) = P(A,) + P(A U A3) = P((A; ia Ap) U (Ay ia} A3)) . 


Another application of eq. (1.7) to the second and to the third term in the right-hand 
sum proves the following result. 


Proposition 1.2.4. Let (QO, .A, P) be a probability space and let A;, A2, and A3 be in A. 
Then 


P(A, U Az U A3) = P(A;) + P(A2) + P(A3) 
= [P(Ay ial Ap) ie P(A, fal A3) st P(A2 nN A3)] aE P(A, nN Ad ia} A3) 5 


Remark 1.2.5. A generalization of Propositions 1.2.3 and 1.2.4 from 2 or 3 to an arbit- 
rary number of sets can be found in Problem 1.5. It is the so-called inclusion—exclusion 
formula. 


First, let us explain an easy example of how the properties of probability measures 
apply. 


Example 1.2.6. Let (Q,.A, P) be a probability space. Suppose two events A and Bin A 
satisfy 


P(A)=0.55, P(B)=04 and P(AnB)=0.2. 


Which probabilities do A U B, A\B, A‘ u B‘, and A‘ 1 B possess? 
Answer: An application of Proposition 1.2.4 gives 


P(A UB) = P(A) + P(B) - P(AnB) = 04+ 0.5-0.2= 0.7. 
Furthermore, by eq. (1.6) follows 
P(A\B) = P(A) - P(AnB) =0.5-0.2= 0.3. 
Finally, by De Morgan’s rules and another application of eq. (1.6) we get 
P(AS UBS) =1-P(AnB)=0.8 and P(ASB) = P(B\A) = P(B) - P(AnB) = 0.2. 
In summary, say one has to take two exams A and B. The probability of passing exam A 
is 0.5, the probability of passing B equals 0.4, and to pass both is 0.2. Then with prob- 


ability 0.7 one passes at least one of the exams, with 0.3 exam A, but not B, with 0.8 
one fails at least once, and, finally, the probability to pass B but not A is 0.2. 
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1.3 Discrete Probability Measures 


We start with the investigation of finite sample spaces. They describe random experi- 
ments where only finitely many different results may occur, as, for example, rolling a 
die n times, tossing a coin finitely often, and so on. Suppose the sample space contains 
N different elements. Then we may enumerate these elements as follows: 


QO = {w,, a , wy}. 


As o-field we choose A = P(Q). 
Given an arbitrary probability measure P : P(Q) > R set 


Dj = P({w;}) 5 j =1,...,N. (1.8) 


In this way we assign to each probability measure P numbers pi, ... , py. Which prop- 
erties do they possess? The answer to this question gives the following proposition. 


Proposition 1.3.1. If P is a probability measure on P(Q), then the numbers p; defined by 
eq. (1.8) satisfy 


N 


O<pj<1 and > p= 1. (1.9) 
jel 


Proof: The first property is an immediate consequence of P(A) > 0 for all A ¢ Q. 
The second property of the p;s follows by 


N N N 
1= PQ) = P(_fw}) =) Pda) = ov. 2 
j=l j-l jel 


Conclusion: Each probability measure P generates a sequence (jit , of real numbers 
satisfying the properties (1.9). Moreover, if A ¢ O, then we have 


P(A)= >> pj. (1.10) 


{j:wj<A} 


In particular, the assignment P > yt , is one-to-one. 

Property (1.10) is an easy consequence of A = Jy. we aj{w;}. Furthermore, it tells us 
that P is uniquely determined by the p;s. Note that two probability measures P; and P2 
on (Q, A) coincide if P,(A) = P2(A) for all A ¢ A. 


Now let us look at the reverse question. Suppose we are given an arbitrary sequence 
(jt , of real numbers satisfying the conditions (1.9). 
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Proposition 1.3.2. Define P on P(Q) by 


P(A)= >> pj. (1.11) 


{j:wj<A} 
Then P is a probability measure satisfying P({w;}) = p; for allj < n. 


Proof: P has values in [0,1] and P(Q) = 1 by pal Dj = 1. Since the summation over 
the empty set equals zero, P(g) = 0. 

Thus it remains to be shown that P is o-additive. Take disjoint subsets Aj, A>, ... 
of Q. Since Q is finite, there are at most finitely many of the Ajs nonempty. Say, for 
simplicity, these are the first n sets Aj, ... , An. Then we get 


foe) n 
“(Um)=r(Ua)= Sp 

k=1 k-1 firwjelI_, And 

n co co 
z BR =>, >, => PAD, 
k=1 {i:wjeAx} k=1 {f:wjeAx} k=1 
hence P is o-additive. 
By the construction P({w;}) = p;, which completes the proof. o 


Summary: If QO = {w;, ...,Wy}, then probability measures P on P(Q) can be 


identified with sequences (it , satisfying conditions (1.9). 


{Probability measures P on Po} => {Sequences (p; & with (2.9)| 


Hereby the assignment from the left- to the right-hand side goes via p; = P({w;}) while 
in the other direction P is given by eq. (1.11). 


Example 1.3.3. Assume © = {1, 2,3}. Then each probability measure P on P(Q) is 
uniquely determined by the three numbers p,; = P({1}), po = P({2}), and p3 = P({3}). 
These numbers satisfy 1, p2, p3 > Oand p;+p2+p3 = 1. Conversely, any three numbers 
P1, P2, and p3 with these properties generate a probability measure on P(Q) via (1.11). 
For example, if A = {1, 3}, then P(A) = p; + p3. 


Next we treat countably infinite sample spaces, that is, Q = {wy ,W2,...}. Also 
here we may take P(Q) as o-field and as in the case of finite sample spaces, given a 


probability measure P on P(Q), we set 


pj := Pw), j=1,2,... 
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Then wi) obeys the following properties: 


pj>0 and Yop; =1. (1.12) 


The proof is the same as in the finite case. The only difference is that here we have to 
use the o-additivity of P because this time Q = (J j= {aj}. By the same argument follows 
for A ¢ O that 


P(A)= > py. 


{j21: wj<A} 


Hence, again the p;s determine P completely. 
Conversely, let (p; Par be an arbitrary sequence of real numbers with proper- 
ties (1.12). 


Proposition 1.3.4. The mapping P defined by 


P(A)= > py. (1.13) 


{j21: wj<A} 
is a probability measure on P(Q) with P({w;}) = pj,1 <j < 0. 


Proof: The proof is analogous to Proposition 1.3.2 with one important exception. In 
the case #(Q) < oo we used that there are at most finitely many disjoint nonempty 
subsets. This is no longer valid. Thus a different argument is needed. 

Given disjoint subsets A;, Ao, ... inQ set 


Ik ={j 21: wj € Ard. 
Then I, nl) = oifk #1, 
P(Ay) = YB and P (U a = > 7 
Jel k=1 jel 


where I = (24 Ik. 
Since p; > 0, Remark A.5.6 applies and leads to 


P (Us) == >= DO PAD.- 
k=1 


iJ k=1 jel, k=1 


Thus P is o-additive. 
The equality P({w;}) = p;,1 <j < 00, is again a direct consequence of the definition 
of P. a 
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Summary: If O = {w, w2, ...}, then probability measures P on P(Q) can be identified 
with (infinite) sequences (p; ;*, possessing the properties (1.12). 


{Probability measures P on Po} —- {Sequences (p{)j= , with (1. 12)} 


Again, the assignment from the left-hand to the right-hand side goes via p; = P({w;}) 
while the other direction rests upon eq. (1.13). 


Example 1.3.5. For Q = N andj > 1 let p; = 27. These pjs satisfy conditions (1.12) 
(check this!). The generated probability measure P on P(N) is then given by 


1 
P(A) = > x 
jcA 
For example, if A = {2, 4, 6, ...} then we get 


1 
PA)= 5 a aT a 


jeA 


Example 1.3.6. Let Q = Z\{0}, that is, O = {1,-1, 2, -2, ...}. With c > 0 specified later 
on assume 
Cc 
Dk = ke 3 keQ. 


The number c > 0 has to be chosen so that the conditions (1.12) are satisfied, hence it 
has to satisfy 


1 “1 
1=¢ > ep ae 
k=1 


keZ\{0} 


But as is well known? 
ee 
ke 6” 
which implies c = =o Thus P on P(Q) is uniquely described by 


P({k}) = a a » keZ\{o}. 


5 We refer to [Mor16], where one can find an easy proof of this fact. The problem to compute the value 
of the sum is known as “Basel problem.” The first solution was found in 1734 by Leonhard Euler. Note 
that 7,,, 1/K? = ¢(2) with Riemann’s ¢-function. 
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For example, if A = N, then 


pa)- 3 2 31 


Or if A = {2,4, 6, ...}, it follows 


34 1 1 1 
nee oe ae 4 PN = 3: 


For later purposes we want to combine the two cases of finite and countably infinite 
sample spaces and thereby introduce a slight generalization. 

Let © be an arbitrary sample space. A probability measure P is said to be discrete 
if there is an at most countably infinite set D ¢ OQ (i.e., either D is finite or countably 
infinite) such that P(D) = 1. Then for A € QO follows 


P(A) = P(AnD) = >. P({w}). 


weD 


Since P(D‘) = 0, this says that P is concentrated on D. Of course, all previous results 
for finite or countably infinite sample space carry over to this more general setting. 


Discrete probability measures P are concentrated on an at most countably infinite set 
D. They are uniquely determined by the values P({w}), where w « D. 


Of course, if the sample space is either finite or countably infinite, then all probab- 
ility measures on this space are discrete. Nondiscrete probability measures will be 
introduced and investigated in Section 1.5 


Example 1.3.7. We once more model the one-time rolling of a die, but now we take as 
sample space 0 = R. Define P({w}) = z ifw =1,...,6 and P({w}) = 0 otherwise. If 
D = {i, ... , 6}, then P(D) = 1, hence P is discrete. Given A € R, it follows 


#(A nD) 


P(A) = r 


For example, we have P([-2, 2]) = 3 or P([3, 00)) = 4. 
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1.4 Special Discrete Probability Measures 
1.4.1 Dirac Measure 


The simplest discrete probability measure is the one concentrated at a single point. 
That is, there exists an wo € such that P({wo}) = 1. This probability measure is 
denoted by 6,,,. Consequently, for each A ¢ P(Q) one has 


1:WoceA 


0: wo¢A aay) 


bw (A) = | 


Definition 1.4.1. The probability measure 6,,, defined by eq. (1.14) is called Dirac 
measure or point measure at Wo . 


Which random experiment does (Q, P(Q), 64.) model? It describes the experiment 
where with probability one the value wo occurs. Thus, in fact it is a deterministic 
experiment, not random. 

Dirac measures are useful tools to represent general discrete probability meas- 
ures. Assume P is concentrated on D = {w , 2, ...} and let p; = P({w;}). Then we may 
write 


P=) pj6u;- (1.15) 
jel 


Conversely, if a measure P is represented as in eq. (1.15) with certain w; « Q and 
numbers p; > 0, Pi = 1, then P is discrete with P(D) = 1, where D = {w , wo, ...}. 
1.4.2 Uniform Distribution on a Finite Set 


The sample space is finite, say Q = {w;, ... , wn}, and we assume that all elementary 
events are equally likely, that is, 


P({ay}) = -- - = P({ww}) . 
A typical example is a fair die, where O = {1, ... , 6}. 


Since 1 = P(Q) = pa , P({w;}) we immediately get P({w;}) = 1/N for all j < N. If 
Ac Q, an application of eq. (1.11) leads to 


P(A) = ——= 7. (1.16) 
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Definition 1.4.2. The probability measure P defined by eq. (1.16) is called uniform 
distribution or Laplace distribution on the finite set Q. 


The following formula may be helpful for remembrance: 


Number of cases favorable for A 


P(A) = 
A) Number of possible cases 


Example 1.4.3. In a lottery, 6 numbers are chosen out of 49 and each number appears 
only once. What is the probability that the chosen numbers are exactly the six ones on 
my lottery coupon? 

Answer: Let us give two different approaches to answer this question. 
Approach 1: We record the chosen numbers in the order they show up. As a sample 
space we may take 


QO := {(w1, ...,W6): we {1 ... , 49}, w; # wifi Fj}. 


Then the number of possible cases is 


49! 
#(Q) = 49-48-47 -46-45-44 = 7. 

Let A be the event that the numbers on my lottery coupon appear. Which cardinality 
does A possess? 

Say, for simplicity, in our coupon are the numbers 1, 2, ... , 6. Then it is favorable 
for A if these numbers appear in this order. But it is also favorable if (2, 1,3, ... ,6) 
shows up, that is, any permutation of 1, ... ,6 is favorable. Hence #(A) = 6! which 
leads to® 

6! 1 
P(A) = = = 715112 x 10°. 


oa 


Approach 2: We assume that the chosen numbers are already ordered by their size (as 
they are published in a newspaper). In this case our sample space is 


QO := {(@, ...,W6) :1<W1< +++ < We < 49} 


6 To get an impression about the size of this number assume we buy lottery coupons with all possible 
choices of the six numbers. If each coupon is 0.5 mm thick, then all coupons together have a size of 
6.992 km, which is about 4.3 miles. And in this row of 4.3 miles there exists exactly one coupon with 
the six numbers chosen in the lottery. 
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and now 


49 
#(Q) = . 
@-(@) 
Why? Any set of six different numbers may be written exactly in one way in in- 
creasing order and thus, to choose six ordered numbers is exactly the same as to 


choose a (nonordered) set of six numbers. And there are es ) possibilities to choose 
six numbers. In this setting we have #(A) = 1, thus also here we get 


1 
(49) * 
(6) 
Example 1.4.4. A fair coin is labeled with “O” and “1.” Toss it n times and record the 
sequence of Os and 1s in the order of their appearance. Thus, 


P(A) = 


Q := {0, 1}" = {(w1, ... ,Wn) : w; € {0, 1}, 


and #(Q) = 2”. The coin is assumed to be fair, hence each sequence of Os and 1s is 
equally likely. Therefore, whenever A ¢ Q, then 
#(A) 


P(A) =. 


Take, for example, the event A where for some fixed i < n the ith toss equals “0,” 
that is, 


A= {(W1, ... , Wn) : w; = OF7" 


Then #(A) = 2”! leads to the (not surprising) result 


vel al 
a a 


P(A) = 


Or let A occur if we observe for some given k < n exactly k times the number “1.” Then 
#(A) = (;) and we get 
n 1 
P(A) = re 
4 (1) 2n 


Example 1.4.5. We have k particles that we distribute randomly into n boxes. All pos- 
sible distributions of the particles are assumed to be equally likely. How do we get P(A) 
for a given event A ? 

Answer: In this formulation the question is not asked correctly because we did not 
fix when two distributions of particles coincide. 
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Let us illustrate this problem in the case of two particles and two boxes. If the 
particles are not distinguishable (anonymous) then there are three different ways 
to distribute the particles into the two boxes. Thus, assuming that all distributions are 
equally likely, each elementary event has probability 1/3. 

On the other hand, if the particles are distinguishable, that is, they carry names, 
then there exist four different ways of distributing them (check this!), hence each 
elementary event has probability 1/4. 

Let us answer the above question in the two cases (distinguishable and anonym- 
ous) separately. 

Distinguishable particles: Here we may enumerate the particles from 1 to k and 
each distribution of particles is uniquely described by a sequence (qj, ... , ax), where 
aj € {1,...,n}. For example, a; = 3 means that particle one is in box 3. Hence, a 
suitable sample space is 


Q= {(aq, ...,an:1<a<n}. 


Since #(Q) = n* for events A ¢ Q follows 


pu = HA 


Anonymous particles: We record how many of the k particles are in box 1, how many 
are in box 2 up to box n. Thus as sample space we may choose 


Q={(ky, ... kn) tk =0,...,k, ket +--+ thy =k}. 


The sequence (ky, ... , ky) occurs if box 1 contains k, particles, box 2 contains kj, and 
so on. From the results in case 3 of Section A.3.2 we derive 


#(Q) = : 


ray Hy 


Hence, if A ¢ Q, then 


Summary: If we distribute k particles and assume that all partitions are equally 
likely’, then in the case of distinguishable or of anonymous particles 


#(A) k!(n-1)! 
P(A) = —s or P(A) =#(A) (n+k-D!’ 


respectively. 


7 Compare Example 1.4.16 and the following remark. 
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Let us evaluate P(A) for some concrete event A in both cases. Suppose k < n and select 
k of the n boxes. Set 


A := {In each of the chosen k boxes is exactly one particle}. (1.17) 


To simplify the notation assume that the first k boxes have been chosen. The general 
case is treated in a similar way. Then in the “distinguishable case” the event A oc- 
curs if and only if for some permutation 7 ¢ S;, the sequence (7(1), ... ,7(K),0... , 0) 
appears. Thus #(A) = k! and 


P(A) = ‘ ; (1.18) 


In the “anonymous case” it follows #(A) = 1 (why?). Hence here we obtain 


k'(n-1)! 


BA) = (n+k-1)!° 


(1.19) 


Additional question: For k < n define B by 
B := {Each of the n boxes contains at most 1 particle} 


Find P(B) in both cases. 

Answer: The event B is the (disjoint) union of the following events: the k particles 
are distributed in a given collection of k boxes. The probability of this event was cal- 
culated in eqs. (1.18) and (1.19), respectively. Since there are (;;) possibilities to choose 
k boxes of the n we get P(B) = (;)P(A) with A as defined by (1.17), that is, 


n\ Kk! n! 
P(B) = : = d 
(B) (1) nk (n—kInk - 


n\ kKi(n-1! _ n!(n-1)! 
0) ee (n—k!(n+k-D!’ 


P(B) = ( 


respectively. 


1.4.3 Binomial Distribution 


The sample space is O = {0,1,...,n} for some n > 1 and p is a real number with 
O<ps<l. 


Proposition 1.4.6. There exists a unique probability measure Bn,» on P(Q) satisfying 


Bnp({K}) = (1) pka-p)y*, k=0,...,n. (1.20) 
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Proof: In order to use Proposition 1.3.2 we have to verify Byp({k}) > 0 and 
> ko Bu,p({k}) = 1. The first property is obvious because of 0 < p< 1and0<1-p<1. 
To prove the second one we apply the binomial theorem (Proposition A.3.7) with a = p 
and with b = 1 - p. This leads to 


Y Bro) = Y(t) oka -py* = (p +-p)"=1. 
k=0 


k=0 


Hence the assertion follows by Proposition 1.3.2 with py, = Bap({k}),k=0,...,n. @ 


Definition 1.4.7. The probability measure B,,, defined by eq. (1.20) is called 
binomial distribution with parameters n and p. 


Remark 1.4.8. Observe that By,p acts as follows. If A ¢ {0, ... , nm}, then 


Byp(4) =) (1) pian. 


keA 
Furthermore, for p = 1/2 we get 


Braplt{k}) = QE : 


As we saw in Example 1.4.4 this probability describes the k-fold occurrence of “1” when 
tossing a fair coin n times. 


Which random experiment describes the binomial distribution? To answer this ques- 
tion let us first look at the case n = 1. Here we have © = {0, 1} with 


Bn,p({O}) =1-p and Bn,p({1}) =p. 


If we identify “O” with failure and “1” with success, then the binomial distribution 
describes an experiment where either success or failure may occur, and the success 
probability is p. Now we execute the same experiment n times and every time we may 
observe either failure or success. If we have k times success, then there are (d) ways 
to obtain these k successes during the n trials. The probability for success is p and for 
failure 1 — p. By the independence of the single trials, the probability for the sequence 
is p<(1 — p)""*. By multiplying this probability with the number of different positions 
of successes we finally arrive at (7)p*(1 - p)""*, the value of Bn p({k}). 


Summary: The binomial distribution describes the following experiment. We execute 
n times independently the same experiment where each time either success or failure 
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may appear. The success probability is p. Then By,)({k}) is the probability to observe 
exactly k times success or, equivalently, n — k times failure. 


Example 1.4.9. An exam consists of 100 problems where each of the question may be 
answered either with “yes” or “no.” To pass the exam at least 60 questions have to be 
answered correctly. Let p be the probability to answer a single question correctly. How 
big has p to be in order to pass the exam with a probability greater than 75% ? 

Answer: The number p has to be chosen such that the following estimate is 
satisfied: 


100 


> ("r)e%a =p" s O75. 


k=60 


Numerical calculations show that this is valid if and only if p > 0.62739. 


Example 1.4.10. In an auditorium there are N students. Find the probability that at 
least two of them have their birthday on April 1. 

Answer: We do not take leap years into account and assume that there are no twins 
among the students. Finally, we make the (probably unrealistic) assumption that all 
days of a year are equally likely as birthdays. Say success occurs if a student has birth- 
day on April 1. Under the above assumptions the success probability is 1/365. Hence 
the number of students having birthday on April 1 is binomially distributed with para- 
meters N and p = 1/365. We ask for the probability of A = {2, 3, ... , N}. This may be 
evaluated by 


N k N-k 
N\ (1 364 
=D O})-B 1 
d (x) (<5) (¥) v,1/365(L0}) — By, 1/365({1}) 
1 Ge N ON (38 N-1 
365 365 \ 365 : 


For example, N = 500 this probability is approximately 0.397895. 


1.4.4 Multinomial Distribution 


Given natural numbers n and m, the sample space for the multinomial distribution is® 
Q:= {(a, ... km) ENG: gt +--+ thn =n}. 


With certain non-negative real numbers pj, ... , Pm Satisfying p; + - - - + Pm = 1Set 


P({(ki, ... ,km)}) = ( . pr ren pet, (Kis iceokn) oO. (1.21) 


ki, nae sKm 


= 
ee ). 


8 By case 3 in Section A.3.2 the cardinality of Q is ( 
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Recall that the multinomial coefficients appearing in eq. (1.21) were defined in 


eq. (A.15) as 
n _ n! 
Kgs Ke Glo RL 


The next result shows that eq. (1.21) defines a probability measure. 


Proposition 1.4.11. There is a unique probability measure P on P(Q) such that (1.21) 
holds for all (ky, ... 5 km) € Q. 


Proof: An application of the multinomial theorem (Proposition A.3.18) implies 


d> Pdi, ...kmD= @ i P er ce 


(ky, ... :km)eQ ky+-+++km=n 
k;20 


(Dit +--+ +pm)"=1"=1, 


Since P({(ki, ... , km)}) > 0 the assertion follows by Proposition 1.3.2. | 
In view of the preceding proposition, the following definition is justified. 


Definition 1.4.12. The probability measure P defined by eq. (1.21) is called multi- 
nomial distribution with parameters n, m, and p;, ... , Dm. 


Remark 1.4.13. Sometimes it is useful to regard the multinomial distribution on the 
larger sample space Q = Nf. In this case we have to modify eq. (1.21) slightly as 
follows: 


n Ais pe hee Sey = ti 
PU(ky, ... 5h = Ck, in) Pt Pm a 
({(ka km)}) | 0) thky+ eaters tkn #n 


Which random experiment does the multinomial distribution describe? To answer this 
question let us recall the model for the binomial distribution. In an urn are balls of two 
different colors, say white and red. The proportion of the white balls is p, hence 1 — p 
of the red ones. If we choose n balls with replacement, then By,p({k}) is the probability 
to observe exactly k white balls. 

What happens if in the urn are balls of more than two different colors, say of m 
ones, and the proportions of the colored balls are p;, ... , Pm withp, + --- +p~m=1? 

As in the model for the binomial distribution we choose n balls with replacement. 
Given integers kj > 0 one asks now for the probability of the following event: balls 
of color 1 showed up k, times, those of color 2 kz times, and so on. Of course, this 
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probability is zero whenever k,+ --- +k» #n. Butif the sum is n, then a tee pkm is the 
probability for k; balls of color j in some fixed order. There are (i, - hen) ways to order 
the balls without changing the frequency of the colors. Thus the desired probability 


equals GiB a per 


Summary: Suppose in an experiment are m different results possible (e.g., m colors) 
and assume that each time the jth result occurs with probability p;. If we execute the 
experiment n times, then the multinomial distribution describes the probability of the 
following event: the first result occurs k, times, the second k times, and so on. 


Remark 1.4.14. If m = 2, then pp = 1- p; as well as (a) 7 (ie) = (i,)- 
Consequently, in this case the multinomial distribution coincides with the binomial 
distribution Bn,p,. 


Remark 1.4.15. Suppose that all m possible different outcomes of the experiment are 
equally likely, that is, we have 


7 7 il 
Pi= Pi A 


Under this assumption it follows 


PEUCy kl) = ( i Ves eho deen, (1.22) 
ky, ...5km m” 


Example 1.4.16. Suppose we have m boxes B,, ... , Bm and n particles that we place 
successively into these boxes. Thereby p; is the probability to place a single particle 
into box B;. What is the probability that after distributing all n particles there are k; 
particles in the first box, kz in the second up to ky» in the last one? 

Answer: This probability is given by formula (1.21), that is, 


P{k, particles are in By, ... , k» particles are in Bm} = (, ? ; ) pe eee pin, 
1s +++ 94m 
Suppose now n < mand that all boxes are chosen with probability 1/m. Find the 
probability that each of the first n boxes B;, ... , By, contains exactly one particle. 
Answer: By eq. (1.22) follows 


PG, ...,1,0...,0)}) = Ege (1.23) 
eee ~\4,...,1,0,...,0 mn ; 
n Ree, 

n 


Remark 1.4.17. From a different point of view we investigated the last problem already 
in Example 1.4.5. But why do we get in eq. (1.23) the same answer as in the case of 
distinguishable particles although the n distributed ones are anonymous? 
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Answer: The crucial point is that we assumed in the anonymous case that all par- 
titions of the particles are equally likely. And this is not valid when distributing the 
particles successively. To see this, assume n = m = 2. Then there exist three different 
ways to distribute the particles, but they have different probabilities. 

1 . 1 
P({(O, 2)}) = P({(2, 0)}) = a while P({(, 1)}) = = 
Thus, although the distributed particles are not distinguishable, they get names due 
to the successive distribution (first particle, second particle, etc.). 


Example 1.4.18. Six people randomly enter a train with three coaches. Each person 
chooses his wagon independently of the others and all coaches are equally likely to be 
chosen. Find the probability that there are two people in each coach. 

Answer: We have m = 3, n = 6, and p; = p2 = p3 = ie Hence the probability we are 
looking for is 


= 0.12345679 . 


PAU@,2,2)) = ( 6 )3 6! 1. 10 


2,2,2) 36 22121 3° 


Example 1.4.19. Inacountry are 40% of the cars gray, 20% are black, and 10% are red. 
The remaining cars have different colors. Now we observe by random 10 cars. What is 
the probability to see two gray cars, four black, and one red? 

Answer: By assumption m = 4 (gray, black, red, and others), p, = 2/5, po = 1/5, 
p3 = 1/10, and p, = 3/10. Thus the probability of the vector (2, 4, 1, 3) is given by 


10 I TP fay 8 Oe ee 
2,4,1,3) \5/ \5)/ \io) \i0) ~~ 2141113! 52 54 10 103 


= 0.00870912. 


1.4.5 Poisson Distribution 


The sample space for this distribution is No = {0,1,2, ...}. Furthermore, A > Oisa 
given parameter. 


Proposition 1.4.20. There exists a unique probability measure Pois, on P(No) such that 


Pois,({k}) = H e”, keNo. (1.24) 


Proof: Because of e“ > 0 follows Pois,({k}) > 0. Thus it suffices to verify 


28 —@ 1 Probabilities 


But this is a direct consequence of 


Definition 1.4.21. The probability measure Pois, on P(No) satisfying eq. (1.24) is 
called Poisson distribution with parameter A > 0. 


The Poisson distribution describes experiments where the number of trials is big, but 
the single success probability is small. More precisely, the following limit theorem 
holds. 


Proposition 1.4.22 (Poisson’s limit theorem). Let (p,)72, be a sequence of numbers 
with O < pn < 1and 


lim npy =A 
n-oo 


for some A > O. Then for all k € No follows 


Jim Bryn ({k}) = Pois,({k}). 


Proof: Write 
_ (7) key ynek 
Buapy ({k}) = i Py -Pn) 


_ Id nt=1)+~-@=k+I) 
k! nk 


(npn)k (1— pn)" (1 - pn) * 


SF Ke _ — (npn) — pa)" (1— Dn), 


kit] n on 


and investigate the behavior of the different parts of the last equation separately. Each 
of the fractions in the bracket tends to 1, hence the whole bracket tends to 1. By as- 
sumption we have npn — A, thus, limp.o0(n pn)* = AK. Moreover, since npn > A with 
A> 0 we get py > 0, which implies limy_...(1 — pn)* = 1. 

Thus, it remains to determine the behavior of (1-p,)" as n > oo. Proposition A.5.1 
asserts that if a sequence of real numbers (X;,)n>1 converges to x € R, then 
lim (1+ ay =e, 


noo n 
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Setting x, := —n py by assumption x, > —A, hence 


x n 
lim (1 - py)" = lim (1+ —*) =e", 
noo N00 n 


If we combine all the different parts, then this completes the proof by 
Jim Bnp, (kh) = a e” = Pois,({k}). a 


The previous theorem allows two conclusions. 
(1) Whenever n is large and p is small, without hesitation one may replace By,, by 
Pois,, where A = np. In this way one avoids the (sometimes) difficult evaluation 
of the binomial coefficients. 


Example 1.4.23. In Example 1.4.10 we found the probability that among N students 
there are at least two having their birthday on April 1. We then used the binomial distri- 
bution with parameters N and p = 1/365. Hence the approximating Poisson distribution 
has parameter A = N/365 and the corresponding probability is given by 


N 
Pois,({2, 3, ...})=1-(1+A)e4 =1- (1+ ete. 
365 
If again N = 500, hence A = 500/365, the approximative probability equals 0.397719. 
Compare this value with the “precise” probability 0.397895 obtained in Example 
1.4.10. 


(2) Poisson’s limit theorem explains why the Poisson distribution describes experi- 
ments with many trials and small success probability. For example, if we look for 
a model for the number of car accidents per year, then the Poisson distribution 
is a good choice. There are many cars, but the probability? that a single driver is 
involved in an accident is quite small. 


Later on we will investigate other examples where the Poisson distribution appears in 
a natural way. 


1.4.6 Hypergeometric Distribution 


Among N delivered machines are M defective. One chooses n of the N machines ran- 
domly and checks them. What is the probability to observe m defective machines in 
the sample of size n? 


9 To call it “success” probability in this case is perhaps not quite appropriate. 
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First note that there are (") ways to choose n machines for checking. In order to 
observe m defective ones these have to be taken from the M defective. The remaining 
n—m machines are nondefective, hence they must be chosen from the N — M nonde- 
fective ones. There are (“) ways to take the defective machines and Cy possibilities 
for the nondefective ones. 

Thus the following approach describes this experiment: 


My (N-M 
(i) Grad 
ae 
n 
Recall that in Section A.3.1 we agreed that (7) = 0 whenever k > n. This turns out 
be useful in the definition of Hy,n. For example, ifm > M, then the probability to 


observe m defective machines is of course zero. 
We want to prove now that eq. (1.25) defines a probability measure. 


Ay mn({m}) := O<m<n. (1.25) 


Proposition 1.4.24. There exists a unique probability measure Hymn on the powerset 
of {0, ... ,n} satisfying eq. (1.25). 


Proof: Vandermonde’s identity (cf. Proposition A.3.8) asserts that for all k, m, and n 


in No 
Gee: 129 


Now replace n by M, next m by N - M, then k by n, and, finally, j by m. Doing so 
eq. (1.26) leads to 


ne 
But this implies 
n n 
1 M\ (N-M 1 N 
ae a Ged ee 
m=0 9) m=0 (n) 
Clearly, Hy,u,n({m} > 0, which completes the proof by virtue of Proposition 1.41. ia 


Definition 1.4.25. The probability measure Hy, defined by eq. (1.25) is called 
hypergeometric distribution with parameters N, M, and n. 
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Example 1.4.26. A retailer gets a delivery of 100 machines; 10 of them are defective. 
He chooses by random eight machines and tests them. Find the probability that two 
or more of the tested machines are defective. 

Answer: The desired probability is 


8 10) / 90 
> one = 0.18195 . 
8 


m=2 


Remark 1.4.27. In the daily practice the reversed question is more important. The size 
N of the delivery is known and, of course, also the size of the tested sample. The 
number M of defective machines is unknown. Now suppose we observed m defective 
machines among the n tested. Does this (random) number m lead to some information 
about the number M of defective machines in the delivery? We will investigate this 
problem in Proposition 8.5.15. 


Example 1.4.28. Ina pond are 200 fish. One day the owner of the pond catches 20 fish, 

marks them, and puts them back into the pond. After a while the owner catches once 

more 20 fish. Find the probability that among these fish there is exactly one marked. 
Answer: We have N = 200, M = 20, and n = 20. Hence the desired probability is 


20) (180 
() Gs) 
(22) 
20 
Remark 1.4.29. The previous example is not very realistic because in general the 
number N of fish is unknown. Known are M and n, the (random) number m was ob- 


served. Also here one may ask whether the knowledge of m leads to some information 
about N. This question will be investigated later in Proposition 8.5.17. 


Ho00,20,20({1}) = = 0.26967. 


Example 1.4.30. In a lottery 6 numbers are chosen randomly out of 49. Suppose we 
bought a lottery coupon with six numbers. What is the probability that exactly k, k = 
0, ... ,6, of our numbers appear in the drawing? 

Answer: There are n = 6 numbers randomly chosen out of N = 49. Among the 
49 numbers are M = 6 “defective.” These are the six numbers on our coupon, and we 
ask for the probability that k of the “defective” are among the chosen six. The question 
is answered by the hypergeometric distribution Hy96,6, that is, the probability of k 
correct numbers on our coupon is given by 


6) ( 43 
Hyg6,6({k}) = (i) (6-1) , k=0,...,6. 


() 


The numerical values of these probabilities for k = 0, ... ,6 are 
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k Probability 


0.435965 
0.413019 
0.132378 
0.0176504 
0.00096862 
0.0000184499 
7.15112 - 1078 


nm BWNFr OC 


Remark 1.4.31. Another model for the hypergeometric distribution is as follows: in an 
urn are N balls, M of them are white, the remaining N — M are red. Choose n balls out 
of the urn without replacing the chosen ones. Then Hy,y,n({m}) is the probability to 
observe m white balls among the n chosen. 

If we do the same experiment, but now replacing the chosen balls, then this is 
described by the binomial distribution. The success probability for a white ball is p = 
M/N, hence now the probability for m white balls is given by 


Bruin((r}) = (") (F y ( ’ ) . 


It is intuitively clear that for large N and M (and comparable small n) the difference 
between both models (replacing and nonreplacing) is insignificant. Imagine there are 
10° white and also 10° red balls in an urn. Choosing two balls it does not matter a lot 
whether the first ball was replaced or not. 


The next proposition makes the previous observation more precise. 
Proposition 1.4.32. If0 <m<nand0 < p <1, then follows 


lim Ay,u,n({m}) = Bn,p({m}) . 
M/N-p 


Proof: Suppose first 0 < p < 1. Then the definition of the hypergeometric distribution 
yields 

M.---(M-m+1) (N-M)---(N-M-(n-—m)+1) 

w _ : m! (n—m)! 

go en = Ne NW-1)---(N=n#l) 

M/N-p M/N-p nl 


Me 5 (Mom) ql M) ae, (= M~(n-m)+1) 
= lim (") [4 N il N N | aa 
ve (’-y)---GQ- 4) 


- (Fora py = Br plti) 
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Note that if either m = 0 or m = n, then the first or the second bracket in eq. (1.27) 
become 1, thus they do not appear. 

The cases p = 0 and p = 1 have to be treated separately. For example, if p = 0, the 
fraction in eq. (1.27) converges to zero provided that m > 1. If m = 0, then 


(=) 2.4 @7=eo 
lim N N_“ ~1=B,({0}). 
ier (fee) a eae 7,0 
eae (l= 9) (=4,) 


The case p = 1 is treated similarly. Hence, the proposition is also valid in the border 
cases. a 


Example 1.4.33. Suppose there are N = 200 balls in an urn, M = 80 of them are white. 
Choosing n = 10 balls with or without replacement we get the following numerical 
values. Note that p = M/N = 2/5. 


m Ay un({m}) Bnp({m}) 

di 0.0372601 0.0403108 
2 0.118268 0.120932 

3 0.217696 0.214991 

4 0.257321 0.250823 

5 0.204067 0.200658 
6 0.10995 0.111477 

7 0.0397376 0.0424673 
8 0.00921879 0.0106168 
9 0.0012395 0.00157286 


1.4.7 Geometric Distribution 


Ata first glance the model for the geometric distribution looks as that for the binomial 
distribution. In each single trial we may observe “O” or “1,” that is, failure or success. 
Again the success probability is a fixed number p. While in the case of the binomial 
distribution, we executed a fixed number of trials, now this number is random. More 
precisely, we execute the experiment until we observe success for the first time. Re- 
corded is the number of necessary trials until this first success shows up. Or, in other 
words, a number k > 1 occurs if and only if the first k — 1 trials were all failures and the 
kth one success, that is, we observe the sequence (0, ... , 0,1). Since failure appears 
Rie aeae, 


k-1 
with probability 1—p and success shows up with probability p, the following approach 
is plausible: 


Gp({k}) = pa-p)F1, keN. (1.28) 


Proposition 1.4.34. If 0 < p < 1, then (1.28) defines a probability measure on P(N). 
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Proof: Because of p(1— p)*! > 0 it suffices to verify >, Gp({k}) = 1. Using the 
formula for the sum of a geometric series this follows directly by 


= _yyel _ = k_ 1 _ 
dpa p) pee il See era 


Observe that by assumption 1 — p < 1, thus the formula for geometric series applies. m 


Definition 1.4.35. The probability measure G, on P(N) defined by eq. (1.28) is 
called geometric distribution with parameter p. 


If p = O, then success will never show up, thus, Gp is not a probability measure. On 
the other hand, for p = 1, success appears with probability one in the first trial, that is, 
Gp = 6. Therefore, this case is of no interest. 


Example 1.4.36. Given a number n€N, let Ay = {k ¢ N: k > n}. Find G,(Ap). 

Answer: We answer this question by two different approaches. 

At first we remark that A, occurs if and only if the first occurrence of success shows 
up strictly after n trials or, equivalently, if and only if the first n trials were all failures. 
But this event has probability By,,({0}) = (1- p)", hence Gp(An) = (1- p)”. 

In the second approach we use eq. (1.28) directly and obtain 


Gp(An) = >> Gp({k) =p D> (1-p)t =p(l-p)" Sa - py 


k=n+1 k=n+1 k=0 


=(1-p)”. 


1 
G=p)' 
Bee Ted 
Example 1.4.37. Roll a die until number “6” occurs for the first time. What is the 
probability that this happens in roll k? 
Answer: The success probability is 1/6, hence the probability of first occurrence of 
“6” in the kth trial is (1/6)(5/6)«1. 


k Probability 


1 0.166667 
2 0.138889 
3 0.115741 

12 0.022431 


13 0.018693 
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Example 1.4.38. Roll a die until the first “6” shows up. What is the probability that 
this happens at an even number of trials? 

Answer: The first “6” has to appear in the second or fourth or sixth, and so on, 
trial. Hence, the probability of this event is 


co 


= 2-1 _ 9 2-2 _ 1 _5 
29 36 pe 5/6) 5 1-(5/62 11° 


> Gye ({2k}) = 


k=1 


lon 


Example 1.4.39. Play a series of games where p is the chance of winning. Whenever 
you put M dollars into the pool you get back 2M dollars if you win. If you lose, then 
the M dollars are lost. 

Apply the following strategy. After losing double the amount in the pool in the 
next game. Say you start with $1 and lose, then next time put $2 into the pool, then 
$4, and so on until you win for the first time. As easily seen, in the kth game the stakes 
is 21 dollars. 

Suppose for some k > 1 you lost k —- 1 games and won the kth one. How much 
money did you lose? If k = 1, then you lost nothing, while for k > 2 you spent 


1-294 20. $ Pe to 


dollars. Note that 2! — 1 = O if k = 1, hence for all k > 1 the total lost is 2! — 1 dollars. 
On the other hand, if you win the kth game, you gain 2"! dollars. Consequently, 
no matter what the results are, you will always win 2! — (21 - 1) = 1 dollars!®. 
Let X(k) be the amount of money needed in the case that one wins for the first time 
in the kth game. One needs 1 dollar to play the first game, 1+ 2 = 3 dollars to play the 
second, until 


eae eee see a | 
to play the kth game. Thus, X(k) = 2 -1and 
P{X = 2* — 1} = P{First win in game k} = p(i-p)k1, k=1,2.... 


In particular, if p = 1/2 then this probability equals 2-*. For example, if one starts the 
game with 127 = 2’ —1 dollars in the pocket, then one goes bankrupt if the first success 
appears after game 7. The probability for this equals )°7°, 2-* = 2-7 = 0.0078125. 


1.4.8 Negative Binomial Distribution 


The geometric distribution describes the probability for having the first success in 
trial k. Given a fixed n > 1 we ask now for the probability that in trial k success appears 


10 Starting the first game with x dollars one will always win x dollars no matter what happens. 
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not for the first but for the nth time. Of course, this question makes only sense if k > n. 
But how to determine this probability for those k? 

Thus, take k > n and suppose we had success in trial k. When is this the nth one? 
This is the case if and only if we had n — 1 times success during the first k — 1 trials or, 
equivalently, k—n failures. There exist sae possibilities to distribute the k — n failures 
among the first k — 1 trials. Furthermore, the probability for n times success is p” and 
for k — n failures it is (1 — p)*", hence the probability for observing the nth success in 
trial k is given by 


: k-1 o 
By p({k}) = ie : 1e" (=p), k=nnetys.s « (1.29) 
We still have to verify that there is a probability measure satisfying eq. (1.29). 
Proposition 1.4.40. By 
_ k-1 ke 
Br, p({k}) = (a 7 )e"a-p) By Renntl, vse, 
a probability measure B,,, on P({n,n+1, ...}) is defined. 


Proof: Of course, By p tk} > 0. Hence it remains to show 


Y > Br p {kh =1 or, equivalently, > Br p(tk +n})=1. (1.30) 
k=n k=0 


Because of Lemma A.3.9 we get 
Brp(k-+m) = ("" 6") p"a-ph 
= (; e" (-1) (1 - p)k = & )p" gai, (1.31) 


where the generalized binomial coefficient is defined in eq. (A.13) as 


BY Mee) 2 (enh ka) 
(7) ; 3 


In Proposition A.5.2 we proved for |x| < 1 


s (7) x= any (1.32) 


k=0 
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Note that 0 < p < 1, hence eq. (1.32) applies with x = p —- 1 and leads to 


y ie ) (p - 1) = = (1.33) 


k=0 


Combining eqs. (1.31) and (1.33) implies 
St > BAe ee 
YBatk+m) =p" S() R= pt 1, 
k=0 k=0 
thus the equations in (1.30) are valid and this completes the proof. a 
Definition 1.4.41. The probability measure B,, , with 


k= = 
B, pk} := b - 1p" Gap)" = (; c :)e" (=p) kan ne le 


is called negative binomial distribution with parameters n > 1 and p € (0, 1). Of 
course, Bi = Gp. 


Remark 1.4.42. We saw in eq. (1.31) that 


_ n+k-1 —n 
Bry(tkem=("*E!pra-pk=({)p"e-DE. 134 
Alternatively one may define the negative binomial distribution also via eq. (1.34). 
Then it describes the event that the nth success appears in trial n + k. The advantage 
of this approach is that now k € No, that is, the restriction k > n is no longer needed. 
Its disadvantage is that we are interested in what happens in trial k, not in trial k +n. 


Example 1.4.43. Roll a die successively. Determine the probability that in the 20th 
trial number “6” appears for the fourth time. 
Answer: We have p = 1/6, n = 4, and k = 20. Therefore, the probability for this 


event is given by 
- eh /1\7 (5 
B,, 1j6({20}) (; :) ( 7) ( 2) = 0.0404407 


Let us ask now for the probability that the fourth success appears (strictly) before 
trial 21. This probability is given by 


20 


\ rs ‘) e) ou = 0.433454. 


k=4 
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Example 1.4.44. There are two urns, say Up and Uj, each containing N balls. Choose 
one of the two urns by random and take out a ball. Hereby Up is chosen with probab- 
ility 1 - p, hence U; with probability p. Repeat the procedure until we choose the last 
(the Nth) ball out of one of the urns. What is the probability that there are left m balls 
in the other urn, where m = 1, ... ,N 2! 

Answer: Form = 1, ... , N let Am be the event that there are still m balls in one of 
the urns when choosing the last ball out of the other. Then A,, splits into the disjoint 
events Am = A°, u Al, where 

Ae occurs if we take the last ball out of Up and U; contains m balls and 

Al, occurs if choosing the Nth ball out of U; and there are m balls in Up. 

Let us start with evaluating the probability of A},. Say success occurs if we choose 
urn U;. Thus, if we take out the last ball of urn U;, then success occurred for the Nth 
time. On the other hand, if there are still m balls in Up, then failure had occurred N-—m 
times. Consequently, there are still m balls left in urn Up if and only if the Nth success 
shows up in trial N + (N — m) = 2N - m. Therefore, we get 


PAR) = By (20m) = (7!) pap. (1.35) 


The probability of A°, may be derived from that of A!, by interchanging p and 1 - p 
(success occurs now with probability 1 — p). This yields 


PAS) = By. (Nm) = (7A ') ppt (1.36) 


Adding eggs. (1.35) and (1.36) leads to 


_(2N-m-1\7 N-m , ,,N-m N 
Plan) = ( a )[p Capp (bap) | ; 
m=1,...,N. 
If p = 1/2, that is, both urns are equally likely, the previous formula simplifies to 
P(Am) = (* ho ‘) 2-2Nemel | (1.37) 
N-m 


Remark 1.4.45. The case p = 1/2 in the previous problem is known as Banach’s match- 
box problem. In each of two matchboxes are N matches. One chooses randomly a 
matchbox (both boxes are equally likely) and takes out a match. What is the prob- 
ability that there are still m matches left in the other box when taking the last match 
out of one of the boxes? The answer is given by eq. (1.37). 


11 There exists a slightly different version of this problem. What is the probability that there are m 
balls left in one of the urns when choosing for the first time an empty one. Note that in this setting also 
m = O may occur. 
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Example 1.4.46. We continue Example 1.4.44 with 0 < p < 1 and ask the following 
question: What is the probability that U; becomes empty before Up? 

Answer: This happens if and only if Up is nonempty when choosing U; for the Nth 
time, that is, when in Up there are m balls left for some m = 1, ... ,N. Because of 
eq. (1.35), this probability is given by 


N N 
>> P(A), = a ae ae -)) =p" 


m=1 1 


N-1 
N “ oe ‘)a- pk. (1.38) 


k=0 


ll 
s 


Remark 1.4.47. Formula (1.38) leads to an interesting (known) property of the bino- 
mial coefficients. Since eu P(Am) = 1, by eqs. (1.38) and (1.36) we obtain 


N-1 
s lars ‘) [p’a—p)* + =p)" pk] =1 
k=0 


or, setting n = N —1, to 


> (" ‘) [p™ (Ql —pyr a (1 - py" p*] =f. 


k=0 


In particular, if p = 1/2, this yields 


forn=0,1,.... 


1.5 Continuous Probability Measures 


Discrete probability measures are inappropriate for the description of random ex- 
periments where uncountably many different results may appear. Typical examples 
of such experiments are the lifetime of an item, the duration of a phone call, the 
measuring result of workpiece, and so on. 

Discrete probability measures are concentrated on a finite or countably infinite 
set of points. An extension to larger sets is impossible. For example, there is no”? 
probability measure P on [0, 1] with P({t}) > 0 for t ¢ [0, 1]. 


12 Compare Problem 1.31. 
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Consequently, in order to describe random experiments with “many” possible 
different outcomes another approach is needed. To explain this “new” approach let 
us shortly recall how we evaluated P(A) in the discrete case. If Q is either finite or 
countably infinite and if p(w) = P({w}), w « QO, then with this p : Q + R we have 


P(A) = pw). (1.39) 


weA 


If the sample space is R or R", then such a representation is no longer possible. In- 
deed, if P is not discrete, then, we will have p(w) = 0 for all possible observations w. 
Therefore, the sum in eq. (1.39) has to be replaced by an integral over a more general 
function. We start with introducing functions p, which may be used for representing 
P(A) via an integral. 


Definition 1.5.1. A Riemann integrable function p : R — R is called probability 
density function or simply density function if 


HOSO. een. And i ACG at, (1.40) 


Remark 1.5.2. Let us more precise formulate the second condition about p in the pre- 
vious definition. For all finite intervals [a, b] in R the function p is Riemann integrable 
on [a, b] and, moreover, 


b 
jim, | p(x) dx =1. 
a 


b-++o0o 


The density functions we will use later on are either continuous or piecewise con- 
tinuous, that is they are the composition of finitely many continuous functions. These 
functions are Riemann integrable, hence in this case it remains to verify the two 
conditions (1.40). 


Example 1.5.3. Define p on R by p(x) = Oif x < 0 and by p(x) = e” if x > 0. Then pis 
piecewise continuous, p(x) > O if x ¢ R and satisfies 


co b b 
/ p(x)dx = lim e*dx = lim [ e*| 1- lim e? =1. 
= boo (0) 0 


aa boo boo 


Hence, p is a density function. 
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Definition 1.5.4. Let p be a probability density function. Given a finite interval 
[a, b], its probability (of occurrence) is defined by 


b 
P({a, b]) = | p(x)dx. 


a 


A graphical presentation of the previous definition is as follows. The probability 
P({a, b]) is the area under the graph of the density p, taken from a to b. 


1.0 § 
0.8 
0.6} 
0.4} 


0.2/7 


a 05. 1.0 b 15 2.0 
Figure 1.1: Probability of the interval [a, b]. 


Let us illustrate Definition 1.5.4 by the density function regarded in Example 1.5.3. 
Then 


P(la,b) = [eax [ e*] ete? 


a 
whenever 0 < a < b < o0. On the other hand, if a < b < 0, then P([a, b]) = 0 while for 
a< 0 < bthe probability of [a, b] is calculated by 


P([a, b]) = P({0, b]) =1-e°. 


Remark 1.5.5. Definition 1.5.4 of the probability measure P does not fit into the scheme 
presented in Section 1.1.3. Why? Probability measures are defined on o-fields. But the 
collection of finite intervals in R is not a o-field. It is neither closed under taking com- 
plements nor is the union of intervals in general again an interval. Furthermore, it is 
far from being clear in which sense P should be o-additive. 


The next result justifies the approach in Definition 1.5.4. Its proof rests upon an 
extension theorem in Measure Theory (cf. [Coh13] or [Dud02]). 
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Proposition 1.5.6. Let B(IR) be the o-field of Borel sets introduced in Definition 1.1.15. 
Then for each density function p there exists a unique probability measure P : B(R) > 
[0, 1] such that 


b 
P({a, b]) -| p(x) dx forall a<b. (1.41) 


a 


Definition 1.5.7. A probability measure P on B(R) is said to be continuous” 
provided that there exists a density function p such that for a < b 


b 
P((a, b)) = iL Peaiaee (1.42) 


a 


The function p is called density function or simply density of P. 


Remark 1.5.8. Note that changing the density function at finitely many points does 
not change the generated probability measure. For instance, if we define p(x) = 0 if 
x < Oand p(x) = e“ if x > 0, then this density function is different from that in 
Example 1.5.3 but, of course, generates the same probability measure 

Moreover, observe that eq. (1.42) is valid for all a < bif and only if for each t ¢ R 


t 
P((-co, t]) = / pooae: (1.43) 


—co 


Consequently, P is continuous if and only if there is a density p with eq. (1.43) fort € R. 


Proposition 1.5.9. Let P : B(R) = [0,1] be a continuous probability measure with 

density p. Then the following are valid: 

1. PCR) =1. 

2. Foreacht « R follows P({t}) = 0. More generally, if A c R is either finite or countably 
infinite, then P(A) = 0. 

3. Foralla < bwe have 


b 
P((a, b)) = P((a, b]) = P([a, b)) = P([a, b]) = / p(x) dx. 


a 


Proof: Let us start with proving P(R) = 1. Forn > 1 set A, := [-n,n] and note that 
A; © A) © --- as wellas J, An = R. Thus we may use that P is continuous from 
below and by the properties of the density p we obtain 


13 The mathematical correct notation would be “absolutely continuous.” But since we do not treat so- 
called “singularly continuous” probability measures, there is no need to distinguish between them, 
and we may shorten the notation to “continuous.” 
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PCR) = Jim Pn) = Jim f payax= [poo dx =1. 


To verify the second property fix t « R and define for n > 1 intervals B, by By, := 
[t, t+ +]. Now we have B; 2 B) 2 - - - and()~, Bn = {t}. Use this time the continuity 
from above. Then we get 


tel 
P({e}) = lim P(B,) = lim / " n(x) dx = 0. 
n-oo n-oo t 


If A = {ti, t2, ...}, then the o-additivity of P together with P({t;}) = 0 give 


P(A) = ) PU) =0 


j=l 


as asserted. 
The third property is an immediate consequence of the second one. Observe 


[a, b] = (a, b) u fa} u {b}, 
hence P([a, b]) = P((a, b)) + P({a}) + P({b}) proving (1.42) by P({a}) = P({b}) = 0. | 


Remark 1.5.10. Say a set C ¢ R can be represented as C = Us J; with disjoint (open 
or half-open or closed) intervals Jj, then 


= Df eax [pear 


More generally, if a set B may be written as B = (7°, C, where the C,s are a union of 
disjoint intervals, and satisfy C; 2 C) 2 - - -, then 


P(B) = Jim: P(Cy) . 


In this way, one may evaluate P(B) for a large class of subsets B ¢ R. 


1.6 Special Continuous Distributions 
1.6.1 Uniform Distribution on an Interval 


Let I = [a, B] be a finite interval of real numbers. Define a function p : R > R by 


wt: 


_ | Baixé [a, B] 
D(x): | 0 :x¢lagl (1.44) 
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Proposition 1.6.1. The mapping p defined by eq. (1.44) is a probability density function. 


Proof: Note that p is piecewise continuous, hence Riemann integrable. Moreover, 
p(x) > Oforx « Rand 


[Coeax- [) a= Le a)=1. 


Consequently, p is a probability density. a 


Definition 1.6.2. The probability measure P generated by the density in eq. (1.44) 
is called uniform distribution on the interval I = [a, f]. 


How is P([a, b]) evaluated for some interval [a, b]? Let us first treat the case [a, b] ¢ I. 
Then 


_b-a__ Length of [a, b] 
B-a Length of [a, B] 


b 
(lab) = [ ) ax (1.45) 
a B a 

This explains why P is called “uniform distribution.” The probability of an interval 
[a, b] ¢ I depends only on its length, not on its position. Shifting [a, b] inside I does 
not change its probability of occurrence. 

If [a, b] is arbitrary, not necessarily contained in J, then P([a, b]) can be easily 
calculated by 


P({a, b]) = P([a, b] n 1). 


Example 1.6.3. Let P be the uniform distribution on [0, 1]. Which probabilities have 
[-1, 0.5], [0, 0.25] u [0.75, 1], (-c0, t] ift €R, and Ac R, where A = U2, | aha, & |? 
Answer: The first two intervals have probability 5. If t ¢ R, then 


Oo: t<O 
P((-oo, t)={t:0<t<1 
| en | 


Finally, observe that the intervals [te x are disjoint subsets of [0, 1]. Hence we get 


co 


_ 1 1 ]_ ap\ 1 4 5ap 
P(A) aE: =| (1 2 )igrt-? 


n=1 n=1 
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Example 1.6.4. A stick of length L is randomly broken into two pieces. Find the 
probability that the size of one piece is at least twice that of the other one. 

Answer: This event happens if and only if the point at which the stick is broken 
is either in [0, L/3] or in [2L/3, L]. Assuming that the point at which the stick is broken 
is uniformly distributed on [0, L], the desired probability is 5. Another way to get this 
result is as follows. The size of each piece is less than twice as that of the other one 
if the point at which the stick is broken is in [L/3, 2L/3]. Hence, the probability of the 
complementary event is 1/3 leading again to 2/3 for the desired probability. 


Example 1.6.5. Let Co := [0, 1]. Extract from Cp the interval (3, 5), thus it remains C, = 
[0, $]uU[, 1]. To construct C2 extract from C; the two middle intervals (3, 5) and (, 8), 
hence C = [0, §] U[§, 3] u[§, 4] v [§, 1. 

Suppose that through this method we already got sets C, which are the union of 
2" disjoint closed intervals of length 3". In order to construct Cy+1, split each of the 2” 
intervals into three intervals of length 3-”"! and erase the middle one of these three. 
In this way we get Cy41, which consists of 2”*! disjoint intervals of length 3-""!. Finally, 
one defines 


The set C is known as the Cantor set. Let P be the uniform distribution on [0, 1]. Which 
value does P(C) have? 

Answer: First observe that Co > C; > C2 > - - -, hence, using that P is continuous 
from above, it follows 


P(C) = Jim P(Cn). (1.46) 


The sets C,;, are the disjoint union of 2” intervals of length 3". Consequently, it follows 
P(Cn) = a which by eq. (1.46) implies P(C) = 0. 
One might conjecture that C = g. On the contrary, C is even uncountably infinite. 
To see this we have to make the construction of the Cantor set a little bit more precise. 
Given n > 1 let 


An = {a = (Q4, ... Qn) 2 M4, ..- 5 An-1 € {0, 2}, A = 1}. 


If a= (is 000 sq) € Ans Sete = Fy a and Iq = (Xa»Xa + 3). In this notation 


1 2 1 2 7 8 1 2 
tor (3.5) + Hon = (595) + tav= (+5) and toon= (33437) 
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Then, if Co = [0, 1], for n > 1 we have 


Gr=Cra\ U Ig, hence C= [0,1]\ U U Iq. 


acAn n=1 acAn 


Take now any sequence x1, X2, ... with x, ¢ {0,2} and set x = 72, i. Then x can- 
not belong to any I, because otherwise at least one of the x;,s should satisfy x, = 1. 
Thus x € C, and the number of x that may be represented by x;s with x; € {0,2} is 
uncountably infinite. 


1.6.2 Normal Distribution 


This section is devoted to the most important probability measure, the normal 
distribution. Before we can introduce it, we need the following result. 


Proposition 1.6.6. We have 
/ eR dx = J2n. 


Proof: Set 


and note that a > 0. Then we get 


o-([Lerra) (frerea)=[o[Letteay 


Change the variables in the right-hand double integral as follows: x := rcos@ and 
y :=rsin0, where 0 <r< oand0 < 6 < 27. Observe that 


dx dy = |D(r, 6)|dr dé 


with 


sin@ rcos@ 


ax ax 
Dt. 0)= dee ( *) = 


Ge —-rsin@ 
dr 00 


=rcos?@+rsin?@=r. 


Using x? + y? = r2 cos? + 1’ sin’ 6 = 7”, this change of variables leads to 


2m oo 


2m foe} 
a-{ [rePParae- [ [-e"?] dé = 2r, 
0 10) 
00 


which by a > 0 implies a = ./27. This completes the proof. o 
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Given wy ¢ Rando > O let 


1 2 Joq2 
= —(x-p)*/20 
Pyo(X) := vane e , xeR. (1.47) 
210 
0.2) 
=9 2 4 6 


Figure 1.2: The function p,,c with parameters p = 0, 1, 2, 3and o = 0.5, 0.75, 0.9, 1.1. 


Proposition 1.6.7. If <¢ Rando > 0, then pyo is a probability density function. 


Proof: We have to verify 


i Pyo) dx =1 or / ek W?20? dy = Jina. 


co 


Setting u := (x — p)/o it follows dx = odu, hence Proposition 1.6.6 leads to 


/ eh W220 ay = Gg / edu =oV2n. 


co co 


This completes the proof. a 


Definition 1.6.8. The probability measure generated by p,,,c is called normal dis- 
tribution with expected value py and variance o”. It is denoted by N’(u, 0°), that is, 
foralla<b 


i b Dipy 2 
,0°)([a, b -— | ees he 
Nu, 0°)(La, b]) area e 


Remark 1.6.9. In the moment the numbers pw « Rando > 0 are nothing else than 
parameters. Why they are called “expected value” and “variance” will become clear 
in Section 5. 
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Definition 1.6.10. The probability measure \/(0,1) is called standard normal 
distribution. It is given by 


b 
NO, (la, b) = —— / oP? dx. 


2m 


Example 1.6.11. For example, we have 
af 12 
N(0, 1)([-1, 1]) = —— i e*? dx = 0.682689 or 
(0, 1)([-1, 1]) Soa te 


1 4 2 
0, 1)([2, 4]) = —— 2 dy = 0,0227185 . 
N(0,D(2,4) = / A 


1.6.3 Gamma Distribution 
Euler’s gamma function is a mapping from (0, co) to R defined by 


T(x) := i steSds, x>0. 
0 


TX) 


15 


10+ 


1 2 3 4 5 


Figure 1.3: Graph of the gamma function. 


Let us summarize the main properties of the gamma function. 


Proposition 1.6.12. 

1. I maps (0, co) continuously to (0, oo) and possesses continuous derivatives of any 
order. 

2. Ifx > 0, then 


T(x+1) =xIQ). (1.48) 
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3. Forne N follows T(n) = (n—- 1)!. In particular, '(1) = I'(2) = 1 and TB) = 2. 
4, (1/2) = Ja. 


Proof: For the proof of the continuity and differentiability we refer to [Art64] 
The proof of eq. (1.48) is carried out by integration by parts as follows: 


T(x +1) = [ s* eSds = [-s"e*]> +f xs*teSds. =xI(x). 
0 0 


Note that s* eS = Oifs = Oors > oo. 
From 


(1) = [ eSds =1 


and eq. (1.48) follows, as claimed, 


T(n) = (n-DI(n-1) = (n-1)(n-2)T[(n-2)=--- =(n-1)---1-TM=(n-21)!. 


To prove the fourth assertion we use Proposition 1.6.6. Because of 


Vin = | eat =2 | eat 


0 


it follows that 


i edt = iz (1.49) 
10) 


Substituting s = t?/2, thus ds = t dt, by eq. (1.49) the integral for I'(1/2) transforms to 
co co 2 co 
T(4/2) = sVeSds = i 2 ef tdt = V2 [ et at = Jn. 
0 o ¢ 0 
This completes the proof. a 


If x + oo, then I(x) increases very rapidly. More precisely, the following is valid 
(cf. [Art64]): 


Proposition 1.6.13 (Stirling’s formula for the [-function). For x > 0 there exists a 


number @ « (0, 1) such that 
21 x 
rea) (*) eM, (1.50) 
x \e 
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Corollary 1.6.14 (Stirling’s formula for n-factorial). In view of 
n! =T(n+1) =nI(n) 
formula (1.50) leads to" 
n! = /2nn (=)’ effin (1.51) 


for some 6 « (0, 1) depending on n. In particular, 


n 


n!=J27. 


fone nn+i/2 


Our next aim is to introduce a continuous probability measure with density tightly 
related to the [-function. Given a, B > 0 define py,g from R to R by 


6) :x <0 
Da,p(X) == oe (1.52) 

ATO xP~“e :x>0 
1.2 
1.0 
0.8 
0.6 
0.4 
0.2 

1 2 3 4 


Figure 1.4: The functions pi, with B = 0.5, 1, 1.5, 2and B = 2.5. 


Proposition 1.6.15. For all a, B > 0 the function pq,g in eq. (1.52) is a probability density. 


Proof: Of course, pg,g(x) = 0. Thus it remains to verify 


[rast dx =1. (1.53) 


14 cf. also [Spi08], Chapter 27, Problem 19. 
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By the definition of pg,g we have 


co al co 
ee B-1,-x/a 
| papoax oTS [ xP eM" dx, 


Substituting in the right-hand integral u := x/a, thus dx = adu, the right-hand side 
becomes 


Fl tape rie 
wo I ue~e “du Tm 1?) 1. 


Hence eq. (1.53) is valid, and pa,g is a probability density function. a 


Definition 1.6.16. The probability measure I',,z with density function pq,g is called 
gamma distribution with parameters a and f. For allO0 < a< b< oo 


Ty p(la, b]) = : ® Pleat ay, 1.54 
a,p(La, b]) #T® (1.54) 


Remark 1.6.17. Since pa.p(x) = 0 for x < 0 it follows that I'g.g((-o0, OJ) = 0. Hence, if 
a < bare arbitrary, then 


Ta g(la, b)) = Ta,([0, oo) ia) la, b)) : 
Remark 1.6.18. If 8 ¢N, then the integral in eq. (1.54) cannot be expressed by element- 


ary functions. Only numerical evaluations are possible. 


1.6.4 Exponential Distribution 


An important special gamma distribution is the exponential distribution. This prob- 
ability measure is defined as follows. 


Definition 1.6.19. For A > 0 let E, := T,-1,, be the exponential distribution with 
parameter A > 0. 


Remark 1.6.20. The probability density function p, of E) is given by 


O :x<O0 
Ae* :x>0 


Pac) = | 
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Consequently, if 0 < a < b < oo, then the probability of [a, b] can be evaluated by 
E,((a, bl) = e“*-e™, 
Moreover, 
E,([t,00))=e", t>0. 


Remark 1.6.21. The exponential distribution plays an important role for the descrip- 
tion of lifetimes. For instance, it is used to determine the probability that the lifetime 
of a component part or the duration of a phone call exceeds a certain time T > 0. 
Furthermore, it is applied to describe the time between the arrivals of customers at a 
counter or in a shop. 


Example 1.6.22. Suppose that the duration of phone calls is exponentially distributed 
with parameter A = 0.1. What is the probability that a call lasts less than two time 
units? Or what is the probability that it lasts between one and two units? Or more than 
five units? 

Answer: These probabilities are evaluated by 


Eo, ([0, 2]) = 1- e°°? = 0.181269, Fo,([1, 2]) = e°! — e? = 0.08611 and 


Eoi([5, c0)) = e °° = 0.60653. 


1.6.5 Erlang Distribution 


Another important class of gamma distributions is that of Erlang distributions defined 
as follows. 


Definition 1.6.23. For A > Oandn € N let Ey, := T',1,,. This probability measure 
is called Erlang distribution with parameters A and n. 


Remark 1.6.24. The density p,,, of the Erlang distribution is 


0) 2x <0 
xtlekk sx >0 


Pan) = | qn 


(n-1)! 


Of course, E,; = E,. Thus the Erlang distribution may be viewed as generalized 
exponential distribution. 


An important property of the Erlang distribution is as follows. 
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Proposition 1.6.25. Ift > 0, then 
n-1 (ty 
Eqnllt, oo) =. s 7 eX % 
j-0 
Proof: We have to show that for t > 0 


oo n-1 ; 
/ xt er dy = = a". (1.55) 


foo) An 
x) dx = —_ 
[PO rantoae= A | » 


This is done by induction over n. 
If n = 1 then eq. (1.55) is valid by 


i Prsooax= [ Ae“ dx =e, 
t t 


Suppose now eq. (1.55) is proven for some n > 1. Next, we have to show that it is also 
valid forn +1. Thus, we know 


An 8 = el 
[a f xe dy = ay et (1.56) 
(n-I)! Jt jo J 
and want 
qn oo n A j 
/ xe dy = > ad ett, (1.57) 
no Je = j! 


Let us integrate the integral in eq. (1.57) by parts as follows. Set u := x", hence u’ = 
nx", andv’ =e, thus v = -A-!e*. Doing so and using eq. (1.56) the left-hand side 
of eq. (1.57) becomes 


n+1 °° n co n fore) 
a / xe dy = ae xteA} 4 av / xe dy 
ni Jt n , M-DIS 


_ AO at , 3 OF goat 5 GO ga, 


oF St 


This proves eq. (1.57) and, consequently, eq. (1.55) is valid for all n > 1. a 


1.6.6 Chi-Squared Distribution 


Another important class of gamma distributions is that of y?-distributions. These 
probability measures play a crucial role in Mathematical Statistics (cf. Chapter 8). 
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Definition 1.6.26. Forn > 1 let 
ise Ty n/2 


This probability measure is called y?-distribution with n degrees of freedom. 


Remark 1.6.27. In the moment the integer n > 1 in Definition 1.6.26 is only a 
parameter. The notation “degree of freedom” will become clear when we apply the 
x*-distribution for statistical problems. 


Remark 1.6.28. The density p of ax?-distribution is given by 


w) | 0 :x<O0 
DP\X) = xttl2-1g-x/2 : 
“wn? x>o, 


ie., if0<a<b, then 


1 b 
(la, b -saap | MPH xl? dy , 
Xn(la, b]) map l,~ ° 


1.6.7 Beta Distribution 


Tightly connected with the gamma function is Euler’s beta function B. It maps 
(0, co) x (0, co) to R and is defined by 


1 
B(x, y) := [ sl-s)lds, x,y>0. (1.58) 
0 


The link between gamma and beta function is the following important identity: 


_TO)-TY) 
B(x, y) = Ta+y) ’ 


x,y>O0. (1.59) 
For a proof of eq. (1.59) we refer to Problem 1.27. 

Further properties of the beta function are either easy to prove or follow via 
eq. (1.59) by those of the gamma function. 
1. The beta function is continuous on (0, oo) x (0, co) with values in (0, 00). 
2. For x,y > 0 one has B(x, y) = Bly, x). 
3. Ifx,y > 0, then 


Bix +1, y) = oa Bix, y) . (1.60) 


4. For x > 0 follows B(x, 1) = 1/x. 
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5. ifn,m>1are integers, then 


_ (n-1)!(m-1)! 


Aum) (n+m-1)! 


Definition 1.6.29. Let a, 6 > 0. The probability measure B,,, defined by 


b 
Ba,p(la, bl) := <a! x d=xP dx, O<a<b<1, 


is called beta distribution with parameters a and f. 


The density function qq,g of By,g is given by 


x1. -x)F 12 O0<x<1 


1 
x) = J BGA) 
Ga,p(X) re) : otherwise 


3.0 


2.5 


2.0 


1.5 


1.0 


0.5 


0.2 0.4 0.6 0.8 1.0 


Figure 1.5: Density functions of the beta distribution with 
parameters (0.5, 1.5), (1.5, 2.5), (2.5, 2), (1.5, 2), (2, 1.5), and (2.5, 2.5). 


Remark 1.6.30. It is easy to see that gq,g is a density function. 


° __ i P ict _ y)p-1 _ Bla, B) _ 
J tep00 bx = 5a fx (1 - x) ee Bae 1. 
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Furthermore, since ap) = Oif x ¢ [0, 1], the probability measure By, g is concentrated 
on [0, 1], that is, By,g([0, 1]) = 1 or, equivalently, B,,g(R\[0, 1]) = 0. 


Example 1.6.31. Choose independently n numbers x, ... , X, in [0, 1] according to the 
uniform distribution. Ordering these numbers by their size we get xf < --- < x%.In 
Example 3.7.8 we will show that the kth largest number x; is Bxn-k+1-distributed. In 
other words, ifO0 <a<b<1, then 


n! " 
Pla <x} < b}= Ben eal) = Gy qe / 1d kar. 


1.6.8 Cauchy Distribution 
We start with the following statement. 


Proposition 1.6.32. The function p defined by 


1 


1 
Tee? xeR, (1.61) 


pd) = 
is a probability density. 


Proof: Of course, p(x) > O for x « R. Let us now investigate je p(x) dx. Because of 
limp_,.. arctan(b) = 77/2 and lim,g_,_.. arctan(a) = —7/2 follows 


= i} | 1 b 
/ p(x) dx = — jim lim dx=— lim lim [ arctan x] =). 


s Tl a-00 p00 Jq 14+ x2 TI a-+-00 b-+00 a 


Thus, as asserted, p is a probability density. a 


-6 -4 =2 2 4 6 


Figure 1.6: The density function of the Cauchy distribution. 
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Definition 1.6.33. The probability measure P with density p from eq. (1.61) is 
called Cauchy distribution. In other words, the Cauchy distribution P is char- 
acterized by 


b 
P({a, b]) = i <3 dx= - [ arctan(b) — arctan(a)]. 


1.7 Distribution Function 


In this section we always assume that the sample space is R, even if the random ex- 
periment has only finitely or countably infinite many different outcomes. For example, 
rolling a die once is modeled by (R, P(R), P), where P({k}) = 1/6, k = 1, ...,6, and 
P({x}) = 0 whenever x ¢ {1, ... , 6}. 

Thus, let P be a probability measure either defined on B(R) (continuous case) or 
on P(R) (discrete case) . 


Definition 1.7.1. The function F : R — [0,1] defined by 
F(t) c= P((—c0, t]), te R, 


is called! (cumulative) distribution function of P. 


Remark 1.7.2. If P is discrete, that is, P(D) = 1 for some D = {xj, x2, ...}, then its 
distribution function can be evaluated by 


FO) = )- Px) = Soy, 


xjst gst 


where p; = P({x;}), while for continuous P with probability density p 


t 
F(t) = / yeaa: 


co 


Example 1.7.3. Let P be the uniform distribution on {1, ... , 6}. Then 


O:t<l 
F(t) = Kik<t<k+1, kefl,...,5} 
1:t2>6. 


15 To shorten the notation, mostly we will call it “distribution function” instead of, as often used in 
the literature, “cumulative distribution function” or, abbreviated, CDF. 
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Example 1.7.4. The distribution function of the binomial distribution B,,, is given by 


F(t) = > (j,)eta-pr O<t<ow, 


O<k<t 


and F(t) = Oift < 0. 


0.8 


0.6 


0.4 


0.2 


5 10 15 20 25 


Figure 1.7: Distribution function of the binomial distribution Bas5,o4. 


Example 1.7.5. The distribution function of the exponential distribution E, equals 


(0) :t<0O 
F(t) = 
0 fees 


fe 8 


-4 -2 2 4 6 8 10 


Figure 1.8: Distribution function of Eo>5. 


Example 1.7.6. The distribution function of the standard normal distribution is de- 
noted'® by ®, therefore also called Gaussian ®-function. 


1 t 2 
O(t) = —— e~P2dy, teR. 1.62 
(o =| (1.62) 


16 Sometimes also denoted as “norm(-:).” 


1.7 Distribution Function —— 59 


1.0 { 


0.8 
0:6 
z 


4 -2 2 4 


Figure 1.9: Distribution function of the standard normal distribution (©-function). 


Remark 1.7.7. The Gaussian ®-function is tightly related to the Gaussian error func- 
tion defined by 


erf(t) = - [era teR 
a | ia 


Observe that erf(—t) = —erf(t). The link between the ® and the error function is 


1 


M(t) = 5 E erf (=) and erf(t)=2@(/2t)-1, teR. (1.63) 


Example 1.7.8. Let P be the uniform distribution on the interval [a, 8]. Then its 
distribution function is 


0: t<a 
F(t) = pa USCS B 
1 t>B 


In particular, for the uniform distribution on [0, 1] one obtains 


OF £<0 
Fdj={t:0<t<1 
1 t>1 


The next proposition lists the main properties of distribution functions. 
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Proposition 1.7.9. Let F be the distribution function of a probability measure P on R, 
discrete or continuous. Then F possesses the following properties. 

(1) Fis nondecreasing. 

(2) F(-o0) = limp... F(t) = 0 and F(co) = lim... F(t) = 1. 

(3) Fis continuous from the right. 


Proof: Suppose s < t. This implies (—0o, s] c (—co, t], hence, since P is monotone, we 
obtain 


F(s) = P((-ov, s]) < P((-c0, t]) = FO). 


Thus F is nondecreasing. 
Take any sequence (tn)n>1 that decreases monotonely to —co. Set Ay := (—09, ty]. 
Then A; 2 Ap 2 - -- as wellas(\7°, An = 9. Since P is continuous from above it follows 


Jim F(ty) = Jim P(A) = P(@) =0. 


This being true for any sequence (f,)n>1 tending to —oo implies F(—oo) = 0. 

The proof for F(co) = 1 is very similar. Now (t,)n>1 increases monotonely to oo. If 
as before An := (—00, ty], this time A; © Ap © -- - and °°, An = R. By the continuity 
of P from below now we obtain 


Jim F(tn) = Jim P(Bn) = P(R) =1. 


Again, since the ts were arbitrary, F(co) = 1. 

Thus it remains to prove that F is continuous from the right. To do this, we take 
t ¢ Rand a decreasing sequence (t,)n>1 tending to t. We have to show that ifn > o, 
then F(tn) > F(t). 

As before set Ay := (—00,t,]. Again A; 2 Ap 2 -- -, but now), An = (-ov, ¢]. 
Another application of the continuity from above implies 


F(t) = P((-00, f]) = Jim P(An) = Jim F(tn). 
This is valid for each t « R, hence F is continuous from the right. o 


Properties (1), (2), and (3) in Proposition 1.79 characterize distribution functions. More 
precisely, the following result is true. Its proof is based on an extension theorem in 
Measure Theory. Therefore, we can show here only its main ideas. 


Proposition 1.7.10. Let F : R — R be an arbitrary function possessing the properties 
stated in Proposition 1.79. Then there exists a unique probability measure P on B(R) 
such that 


F(t) = P((-oo, t]), teR. 
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Idea of the proof: If a < bset 
Po((a, b]) := F(b) - F(a). 


In this way we get a mapping Po defined on the collection of all half-open intervals 
{(a, b] : a < b}. The key point is to verify that Po can be uniquely extended to a prob- 
ability measure P on B(R). One way to do this is to introduce a so-called outer measure 
P* defined on P(R) by 


P*(B) := inf ps Po((aj, bil) : BE Ja, ni} F 
i=l i=l 

Generally, this outer measure is not o-additive. Therefore, one restricts P* to B(R). If 
P denotes this restriction, the most difficult part of the proof is to verify that P is o- 
additive. After this has been done, by the construction, P is the probability measure 
possessing distribution function F. 

The uniqueness of P follows by a general uniqueness theorem for probability 
measures asserting the following. 

Let P, and P2 be two probability measures on (Q, A) and let € ¢ A bea collection 
of events closed under taking intersections and generating .A. If P\(E) = P2(E) for all 
E« €, then P, = P>. In our case € = {(-00, t]: t¢ R}. 


Conclusion: If the outcomes of a random experiment are real numbers, then this ex- 
periment can also be described by a function F : R — R possessing the properties 
in Proposition 1.7.9. Then F(f) is the probability to observe a result that is less than or 
equal to t. 


Let us state further properties of distribution functions. 


Proposition 1.7.11. If F is the distribution function of a probability measure P, then for 
alla<b 


F(b) - F(a) = P((a, b)). 


Proof: Observing that (—00, a] € (—0«, b] this is an immediate consequence of 


F(b) - F(a) = P((-09, b]) — P((-0e, al) = P((-ce, b]\(—00, al) = P((a, b]). a 
Since F is nondecreasing and bounded, for each t € R the left-hand limit 


F(t — 0) := lim F(s) 


s<t 


exists and, moreover, F(t — 0) < F(t). Furthermore, by the right continuity of F one has 
F(t - 0) = F(t) if and only if F is continuous at the point t. 


62 —— 1 Probabilities 
If this is not so, then h = F(t) — F(t — 0) > 0, that is, F possesses at t a jump of 
height h > 0. This height is directly connected with the value of P({t}). 


Proposition 1.7.12. The distribution function F of a probability measure P has a jump 
of height h > 0 at t « Rif and only if P({t}) = h. 


Proof: Let (t,)n>1 be a sequence of real numbers increasing monotonely to t. Then, 
using that P is continuous from above, it follows 


h= F(t) - F(t- 0) = Jim [F() — F(tn)] = Jim P((tn, t]) = PC{t}). 
Observe that ()°.,(tn, t] = {t}. This proves the assertion. Oo 
Corollary 1.7.13. The function F is continuous at t € R if and only if P({t}) = 0. 


Example 1.7.14. Suppose the function F is defined by 


QO: t<=+l 
1/3:-1<t<0O 
F(t)=4 1/2: O<t<1 
2/3: 1<t<2 
1: ¢22. 


Then F fulfils the assumptions of Proposition 1.710. Hence there is a probability 
measure P with F(t) = P{(—0o, t]). What does P look like? 

Answer: The function F has jumps at —1, 0, 1, and 2 with heights 1/3, 1/6, 1/6, and 
1/3. Therefore, 


P({-1}) = 1/3, P(O}) = 1/6, P({1}) = 1/6 and P({2}) = 1/3, 


hence P is the discrete probability measure concentrated on D = {-1, 0, 1, 2} with P({t}), 
t ¢ D, given above. 


Suppose now that P is continuous with density function p. Recall that then 


F(t) = P((-c, t]) = [ p(x)dx, teR. (1.64) 


In particular, since F is the function of the upper bound in an integral, it is continuous. 
Next we investigate the question whether we may evaluate the density p 
knowing F. 


Proposition 1.7.15. Suppose p is continuous at some t € R. Then F is differentiable at 
t with 


P(t) = “ro - ptt). 
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Proof: This follows immediately by an application of the fundamental theorem of 
Calculus to representation (1.64) of F. | 


Remark 1.7.16. Let F be the distribution function of a probability measure P. If F is 
continuous, then P({t}) = 0 for all t ¢ R. But does this also imply that P is continuous, 
that is, that P has a density? The answer is negative. There exist probability measures P 
on (R, B(R)) with a continuous distribution function but without possessing a density. 
Such probability measures are called singularly continuous. 

To get an impression of how such probability measures look, let us shortly sketch 
the construction of an example. Let C be the Cantor set introduced in Example 1.6.5. 
The basic idea is to transfer the uniform distribution on [0, 1] to a probability measure 
P with P(C) = 1. The transformation is done by the function f defined as follows. If 
x € [0, 1] is represented as x = °°, ae with xx € {0, 1}, then f(x) = 72, aE. Note that f 
maps [0, 1] into C. If P denotes the uniform distribution on [0, 1], define the probability 
measure P by 


P(B) = P{x € [0,1] : f(x) « B}. 


Then for all t « R we have P({t}) = 0, but since P(C) = 1, P cannot have a density. 
Indeed, such a density should vanish outside C. But, as we saw, the probability of C 
with respect to the uniform distribution is zero. Hence the only possible density would 
be p(t) = 0, t « R. This contradiction shows that P is not continuous in our sense. 


Assuming a little bit more than the continuity of F, the corresponding probability 
measure possesses a density (cf. [Coh13]). 


Proposition 1.7.17. Let F be the distribution function of a probability measure P. If F is 
continuous and continuously differentiable with the exception of at most finitely many 
points, then P is continuous. That is, there is a density function p such that 


t 
F() = P((-o, t]) = / gti, ak 


—oo 


Remark 1.7.18. Proposition 1.7.15 implies p(t) = F’(t) for those t where F’(t) exists. If F 
is not differentiable at t, define p(t) somehow, for example, p(t) = 0. 


Example 1.7.19. For some a, f > O define F by 


0 :t<O 
F(t) = A 
1-e@" 2 t>0 
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It is easy to see that this function satisfies the properties of Proposition 1.79. Moreover, 
it is continuous and continuously differentiable on R\{0}. By Proposition 1.7.17 the 
corresponding probability measure P is continuous and since 


F(t) 0) : <0 
t) = 
ap thea st>0. 


a suitable density function is p(t) = F’(t), t # 0, and p(0) = 0. 


1.8 Multivariate Continuous Distributions 
1.8.1 Multivariate Density Functions 


In this section we suppose that Q = R"”. A subset Q c R" is called a (closed, 
n-dimensional) box" provided that for some real numbers a; < bj, 1<i<n, 


Q={(4, ...,X) € R": ap < xj < bj, 1<i< nt. (1.65) 


Definition 1.8.1. A Riemann integrable function p : R"” — R is said to be an 
n-dimensional probability density function or shorter n-dimensional density 
function if p(x) > 0 for x « R" and, furthermore, 


[pores | vf HED, ao eGR SL 
R" =Co) —0o 


Suppose a box Q is represented with certain a; < b; as in eq. (1.65). Then we set 
by bn 
P= f poddx= [0 +++ [ "pla... xddrn dx. (1.66) 
Q a an 


In analogy to Definition 1.1.15 we introduce now the Borel o-field B(R”). 


Definition 1.8.2. Let C be the collection of all boxes in R”. Then o(C) := B(R") de- 
notes the Borel o-field!®. In other words, B(R") is the smallest o-field containing 
all (closed) boxes in R”. Sets in B(IR") are called (n-dimensional) Borel sets. 


17 Also called “hyper-rectangle.” 
18 Recall that the existence of o(C) was proven in Proposition 1.1.12. 
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Remark 1.8.3. As in the univariate case there exist several other collections of subsets 
in R” generating B(R"). For example, one may choose the collection of open boxes or 
the sets, which may be written as 


(—00, ti] x - + + x (00, th], ti, ..-,t€R. 


With the previous notations, the following multivariate extension theorem is valid. 
Compare Proposition 1.5.6 for the univariate case. 


Proposition 1.8.4. Let P be defined on boxes by eg. (1.66). Then P admits a unique 
extension to a probability measure P on B(R"). 


Definition 1.8.5. A probability measure P on B(R") is called continuous provided 
that there exists a probability density p : R" + R such that P(Q) = /{ 0 p(x) dx for 
all boxes Q © R". The function p is said to be the density function or simply 
density of P. 


Remark 1.8.6. It is easy to see that the validity of eq. (1.66) for all boxes is equivalent 
to the following. If t « Rand By, 4, := (-00, ti] x - - - x (-00, ty], then 


ty tn 
P(Bry, tg) = / pax = / o / p(t, ....%) xn +> du. — (167) 
t- t —-co —-co 


Bey, ...,tn 


Thus P is continuous if and only if eq. (1.67) is satisfied for all  € R. 
Let us first give an example of a multivariate probability density function. 


Example 1.8.7. Regard p : R® — R defined by 


48 X1X2X%3:0<X%1 << X2<X3<1 


D(X, X2, X3) = . 
0) : otherwise 


Of course, p(x) > 0 for x € IR?. Moreover, 


1 px3 x2 
i p(x) dx = 48 i [ i: X1X2X3 dx,dx2dx3 
R3 0 Jo Jo 


1 3 x x3 1, 
-as ff SE dade = 48 | 3dx3=1, 
o Jo 2 0 8 
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hence it is a density function on R?. For example, if P is the generated probability 
measure, then 


1/2 X3 «2X2 ‘| i 
P([O, 1/2]?) =48 If X1X2X3 dx,dx2dx3 = 36 = 6h : 
0 0 0 


1.8.2 Multivariate Uniform Distribution 


Our next aim is the introduction and the investigation of a special multivariate distri- 
bution, the uniform distribution on a set K in R". To do so we remember as we defined 
the uniform distribution on an interval J in R. Its density p is given by 


(5) Ti 
p(s) = 
O:sé¢I 


Here |J| denotes the length of the interval J. Let now K c R"” be bounded. In order to 
introduce a similar density for the uniform distribution on K, the length of the under- 
lying set has to be replaced by the n-dimensional volume, which we will denote by 
vol,,(K). But how is this volume defined? 

To answer this question let us first investigate a box Q represented as in eq. (1.65). 
It is immediately clear that its n-dimensional volume is evaluated by 


voln(Q) = | | i - ai). 


i=1 


If n = 1, then Q is an interval and its one-dimensional volume is nothing else as its 
length. For n = 2 the box Q is the rectangle [a;, b;] x [az, b2] and 


vol2(Q) = (bi — ay)(b2 — a2) 


coincides with the area of Q. If n=3, then vol3(Q) is the ordinary volume of bod- 
ies in R?. 

For arbitrary K c R" the definition of its volume vol,(K) is more involved. Let us 
shortly sketch one way how this can be done. Setting 


voln(K) := inf } }° voln(Q)) : K ¢ |_JQ, Q box} , (1.68) 
jel jel 


at least for Borel sets K ¢ R" a suitable volume is defined. In the case of “ordinary” 
sets as balls, ellipsoids, or similar bodies this approach leads to the known values. 
Background is the basic formula 


volp(K) = / / 1d, +++ dx (1.69) 
K 
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valid for Borel sets K ¢ IR". For example, if K is the cube in R? with corner points (1, 0), 
(0, 1), (-1, 0), and (0, —1), then 


O 14x; 1 1-4 
volo(K) = = [J res00 -| / dx2dx, ei / dx2dx1 
—1 -x;,-1 O x,-1 


: 2 0 ey 
= 2 [400+ Idx +2 fod -x)dx1=2/ +x} +2)x4.-4] =2. 

-1 
Example 1.8.8. Let K,,(r) be the n-dimensional ball of radius r > 0, that is, 


Ky) = {xe R": [x] <r} ={Oa, ....%) € Rei xtt + +22 <r}. 


V(r) := Volyn(Kn(r)), r>0, 


denotes the n-dimensional volume of this ball, an easy change of variables implies 
Vil) = Vn- 1", where V;, = V;,(1). But for Kn = Ky(1) eq. (1.69) gives 


erry, . J [tax 1 df 


wtx2<1- xi} 


=f Vea(ytmx8) ana = ff vaa(VI=e) as. 


Hence, by Vy-1(r) = 1! Vy_1(1) = r™! Vp_1 we obtain 


Vn 


1 1 
Vn = Vn-1° / (- 2)e-De ds =2V,-1- / (i s2)e-DP2 ds 
-1 (0) 


The change of the variables s = y"”, thus ds = } y~" dy, yields 


1 —) 


Va = Vn-a- i y Pa - yy? dy = Va iB(5 
10) 


1 
r(3}) 


Hereby we used eq. (1.59) as well as I'(1/2) = /7. Starting with V, = 2, a recursive 
application of the last formula finally leads to 


= Jn Vn 


qqnl2 P p) qqnl2 


r(3 +1) "ar (3) 


voln(Kn(r)) = Vi(r) = rs £50. 
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If we distinguish between even and odd dimensions, properties of the I-function 
imply 
k k+l ok 
1 2750 
Va0)=—F and Vga) = ——___-r™ 
2K(1) i 2K+1(7) Oka Di 


where (2k + 1)!! =1-3-5--- (2k-1)(2k +1). 


After the question about the volume is settled we are now in the position to introduce 
the uniform distribution on bounded Borel sets in R”. Thus let K ¢ R" be a bounded 
Borel set in R” with volume vol;,(K). Define p : R" > R by 


—1_:xeK 


P(x) := ae eae (1.70) 


Proposition 1.8.9. The function p defined by eq. (1.70) is an (n-dimensional) probability 
density function. 


Proof: By virtue of eq. (1.69) follows 


[, pooax = [ ata dx = are ee 1dx,--- dx 


— Voly(K) _ 
~ voln(K) 


Since p(x) = Oif x € R”, as asserted, p is a probability density function. a 


Definition 1.8.10. The probability measure P on (R", B(R")) with density p given 
by eq. (1.70) is said to be the (multivariate) uniform distribution on K. 


Let P be the uniform distribution on K. How do we get P(B) for a Borel set B? Let us 
first assume B ¢ K. Then 


_ _ ol _ voln(B) 
ne [pcow ~ Voln(K) | ~ [1a% SE Sal)” 
B 


If B ¢ R" is arbitrary, that is, B is not necessarily a subset of K, by P(B) = P(BnK) it 
follows that!? 


19 This is an alternative way to introduce the uniform distribution on K. 
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vol,(B 1 K) 


ners vol,(K) 


If n =1and K is an interval the last formula coincides with eq. (1.45). 


Example 1.8.11. Two friends agree to meet each other in a restaurant between 1 and 
2 pm. Both friends go to the restaurant randomly during this hour. After they arrive 
they wait 20 minutes each. What is the probability that they meet each other? 

Answer: Let t; be the moment where the first of the two friends enters the restaur- 
ant, while t, is the arrival time of the second one. They arrive independently of each 
other, thus we may assume that the point ft := (t,, tz) is uniformly distributed in the 
square Q := [1,2]?. Observing that 20 minutes are a third of an hour, they meet each 
other if and only if |t,; — t)| < 1/3. 

Setting B := {(t, t) ¢ R? : |t, — ty| < 1/3}, it is easy to see that vol2(Bn Q) = 5/9. 
Hence, if P is the uniform distribution on Q, because of vol2(Q) = 1 it follows P(B) = 
5/9. Therefore, the probability that the friends meet each other equals 5/9. 


Example 1.8.12. Suppose n particles are uniformly distributed in a ball Kr of radius 
R > 0. Let K; be a smaller ball of radius r > 0 contained in Kp. Find the probability that 
exactly k of the n particles are inside K,; for some k = 0, ... ,n. 

Answer: In a first step we determine the probability that a single particle is in K,. 
Since we assumed that the particles are uniformly distributed in Kp, this probability 
equals 


vol3(K,)  (4/3)ar?_—pry3 
vol3(Kr)  (4/3)7R? (3) 


For each of the n particles this p is the “success” probability to be inside K;, hence the 
number of particles in K; is By,)-distributed with p = (r/R). Thus, 


: ; n\ /r\3k (R-r\2-% 
P{k particles in K;} = Bn,p({k}) = (1) (5) ( R ) , k=0,...,n. 


If the number n of particles is big and r is much smaller than R, then the number of 
particles in K, is approximately Pois, distributed, where A = np = a In other words, 


1 (nB\* 
P{k particles in K,} ~ et (=) elk 
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Example 1.8.13 (Buffon’s needle test). Take a needle of length a < 1 and throw it 
randomly on a lined sheet of paper. Say the distance between two lines on the paper 
is 1. Find the probability that the needle cuts a line. 

Answer: What is random in this experiment? Choose the two lines such that 
between them the midpoint of the needle lies. Let x « [0,1] be the distance of the 
midpoint of the needle to the lower line. Furthermore, denote by @ « [-7/2, 7/2] the 
angle of the needle to a line perpendicularly to the lines on the paper. For example, if 
6 = O, then the needle is perpendicular to the lines on the paper while for 0 = +7/2 it 
lies parallel. 

Hence, to throw a needle randomly is equivalent to choosing a point (0, x) 
uniformly distributed in K = [-7/2, 7/2] x [0, 1]. 

The needle cuts the lower line if and only if $ cos @ > x and it cuts the upper line 
provided that $ cos @ > 1-x. 

If 


A = {(0, x) € [-7/2, 1/2] x [0, 1] : x < 5 cos 8 or 1-x< 5 £08 6}, 


then we get 


1,(A 1,(A 
P{The needle cuts a line} = P(A) = WOES)... outa) . 


vol>(K) 1 
But it follows 
m/2 a 
vol,(A) = 2 i — cos 0d@ = 2a, 
—m/2 2 
hence 
2 
P(A) =<. 
1 


Remark 1.8.14. Suppose we throw the same needle n times. Let r, be the relative 
frequency of the occurrence of A, that is, 


a Number of throws where the needle cuts a line 
; ; 


Yn 


As mentioned in Section 1.1.3, if n + oo, then r;, approaches P(A) = 24 Thus for largen 
we have rn ~ - or, equivalently, 7 ~ 24 . Consequently, throwing the needle sufficiently 


NY 
often, a should be close to 7. 
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1.9 *Products of Probability Spaces 
1.9.1 Product o-Fields and Measures 


Suppose we execute n (maybe different) random experiments so that the outcomes 
do not depend on each other. In order to describe these n experiments two different 
approaches are possible. Firstly, we record each single result separately, that is, we 
have n (maybe different) probability spaces (Q;, A1, P1) to (Qn, An, Pn) modeling the 
outcomes of the first up to the nth experiment. 

A second possible approach is that we combine the n experiments into a single 
one. Thus, instead of n different outcomes w; to w,, we observe now a vector wW = 
(W4, ... , Wn). The sample space in this approach is given by OQ = Q; x -- + x Oy. 


Example 1.9.1. When rolling a die n times the outcome is a series of n numbers w; to 
Wn, each in f1, ... ,6}. Now, imagine we have a die with 6” equally likely faces. On 
these faces, all possible sequences of length n with entries from {1, ... , 6} are written. 
Roll this die once. The first experiment may be described by n probability spaces, one 
for each roll. The second experiment involves only one probability space. Neverthe- 
less, both experiments lead to the same result, a random sequence of numbers from 1 
to 6. 


It is intuitively clear that both approaches to this experiment (rolling a die n times) are 
equivalent; they differ only by the point of view. But how to come from one model 
to the other? One direction is immediately clear. If the random result is a vector 
Ww = (W1, ... , Wn), then its coordinates may be taken as the results of the single experi- 
ments”°. But how about the other direction? That is, we are given n probability spaces 
(Oy, Ai, Pi), ... » (Qn, An, Pn) and have to construct a model for the joint execution of 
these experiments. 
Of course, the “new” sample space is 


Q=Q,x +--+ xQn, (1.71) 


but what are A and P ? We start with the construction of the product o-field. 


Definition 1.9.2. Let A; be o-fields on Q),1 <j <n. SetQ =O; x -- - x Oy. Then 
A=ofAix---xAn:Aje Aj 


is called the product o-field of A; to An. It is denoted by A = A1®--- @ An. 


20 Of course, one still has to verify that the distribution of the coordinates is the same as in the single 
experiments. But before we can do this we need a probability measure describing the distribution of 
the vectors (cf. Proposition 1.9.8). 
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Remark 1.9.3. In other words, A is the smallest o-field containing measurable rect- 
angle sets, that is, sets of the form A, x - - - x A, with Aj ¢ Aj, 1<j<n. 


It is easy to see that P(Q;) ® - - - ®@ P(On) = P(Q). A more complicated example is as 
follows. 


Proposition 1.9.4. Suppose Q; = - - - = OQ, = R, hence Q = R". Then the o-field B(R") 
of Borel sets in R" is the n-fold product of the o-fields B(R) of Borel sets in R, that is, 


B(R") = B(R)®--- @ B(R). 


n times 


Proof: We only give a sketch of the proof. Let Q be a box as in eq. (1.65). Then Q = 
A, x -- - x An, where the Ajs are intervals, hence in B(R). By the construction of the 
product o-field it follows that Q « B(R) ® - - - ® B(R). But B(R”) is the smallest o-field 
containing all boxes, which lets us conclude 


B(R") ¢ B(R)@--- @ BR). 


The inclusion in the other direction may be proved as follows: fix az < b2 to ay < bn 
and let 


C, = {C « B(R): C x [az, bo] x - - -[ay, by] ¢ BIR}. 


It is not difficult to prove that C, is a o-field. If C = [a;, bi], then C x [ap, bz] x - - - [an, Dn] 
is a box, thus in B(R"). Consequently, C,; contains closed intervals, hence, since B(R) 
is the smallest o-field with this property, it follows C,; = B(R). This tells us that for all 
B, € B(R) and all a; < b; 


By x [az, bz] x - - + x [an bn] € BUR"). 
Ina next step fix B, ¢ B(R) and a3 < b3 up to ay < by, and set 
C2 = {C € B(R) : By x C x [a3, b3] x - - -[an, bn] € BUR")}. 


By the same arguments as before, but now using the first step, we get C) = B(R), 
that is, 


By, x Bp x [a3, b3] x - - -[an, bn] € BCR") 


for all B,, By ¢ B(R) and a; < bj. 
Iterating further we finally obtain that for all B; « B(R) it follows that 


Bx --- x Bry € BR"). 
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Since B(R)® - -- @B(R) = o{B, x - -- x By : Bj € B(R)} is the smallest o-field containing 
sets B, x - - - x By, this implies 


B(R)® --- ® B(R) < B(R") 


and completes the proof. a 


Let us now turn to the probability measure P on (Q, A) that describes the combined 
experiment. 


Definition 1.9.5. Let (Qj, Aj, P) to (Qn, An, Pn) be n probability spaces. Define O 
by eq. (1.71) and endow it with the product o-field A = A, @---@Apn. A probability 
measure P on (Q, A) is called the product measure of Pj, ... , P, if 


P(A, x - - » x An) = Pi(Ay) --- Pa(An) forall Aj € Aj. (1.72) 
We write P =P, @--- @P, andif P; = - - -P, = Po set 


Pe s=Po@ @ Po. 
Sa 


ntimes 


It is not clear at all whether product measures exist, and if this is so, whether condi- 
tion (1.72) determines them uniquely. The next result shows that the answer to both 
questions is affirmative. Unfortunately, the proof is too complicated to be presented 
here. The idea is quite similar to that used in the introduction of volumes in eq. (1.68). 
The boxes appearing there have to be replaced by rectangle sets Ay x- x An with Aj ¢ A; 
and the volume of the boxes by P)(A;) - - - Pn(An). We refer to [Dur10], Section 1.7, or 
[Coh13], for a detailed proof for the existence (and uniqueness) of product measures. 


Proposition 1.9.6. Let (Qi, Aj, Py), ... » (Qn, An, Pn) be probability spaces. Define O by 
eq. (1.71) and let A be the product o-field of the A;. Then there is a unique probability 
measure P on (Q, A) satisfying eq. (1.72). Hence, the product measure P = P| @ --- @Py 
always exists and is uniquely determined by eg. (1.72). 


Corollary 1.9.7. Let P;, ... , Pn be probability measures on (IR, B(R)). Then there is a 
unique probability measure P on (R", BUR") such that 


P(B, x -- - x By) = P(Bi) --- Pn(Bn) forall B; <¢ B(R). 


Proof: The proof is a direct consequence of Propositions 1.9.6 and 1.9.4. Indeed, take 
P=P,®--- @P, and observe that B(R”) = B(R) @® --- @ B(R). r 
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Let us shortly come back to the question asked at the beginning of this section. 
Suppose we observe a vector w = (Ww, ... , Wy). How are the coordinates distributed? 


Proposition 1.9.8. Let (O,.A,P) be the product probability space of (Q;,.A;, P;) to 
(Qn, An; Pn). Ifj < nand A € A;, then 


P{(w1, ... ,Wn) €Q: w; € A} = P(A). 
Proof: Observe that 
{(wi, ... ,@n) € QO: wj € A} = Oy x - + Oj x Ax Oj x ++ On, 
thus eq. (1.72) implies 
P{(wy, ... ,Wn) € Q: w;j € A} = Py(Qy) «+ - Py-a(Qj-1)- P(A) + - - Pr(Qn) = Pj(A) 
as asserted. a 


How do we get product measures in concrete cases? We answer this question for 
discrete and continuous probability measures separately. 


1.9.2 Product Measures: Discrete Case 
Let QO; to Qn be either finite or countably infinite sets. Given probability measures P; 


defined on P(Q;), 1 < j < n, the following result characterizes the product measure of 
the Pjs. 


Proposition 1.9.9. P is the product measure of P;, ... , Py if and only if 
P({w}) = Pilai}) --- Pal@n}) forall w=(ai,...,@n)€Q. (1.73) 


Proof: One direction is easy. Indeed, if P = P; ® --- @ Py, given w = (a, ... ,Wn) € Q 
set A = {w} and A; = {w;}. Then A = A; x - - - x An, hence 


P({w}) = P(A) = Pi(Ay) - - + Pn(An) = Pilai}) - - - Pn({wnt) 


proving eq. (1.73). 

To verify the other implication let P be a probability measure on (Q, P(Q)) satisfy- 
ing (1.73). We have to show that P fulfills eq. (1.72). Thus choose arbitrary A; ¢ QO; and 
set A = A, x -- - x An. By applying eq. (1.73) it follows 
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P(A) => Pwo)= >) PU{(ar, ...,an)}) 


weA (Wy, ...,Wn)eA 


= ys Py({wy}) - - - Pa({wn}) 


w1€Aj, ... @n€An 


= D> Pifwit) > So Pa({wnh) = Pi(Ad) ... Pn(An). 


Ww 1éA @néAn 


This being true for all Aj ¢ Q; shows that P = P, ® - - - @ Py, and the proof is 
complete. a 


Summary: In the discrete case the product measure is characterized as follows. Given 
AcQ, then 


(Pi@---@P)A= D> Pilfar}) --- Pa{wn}). 


(4, ...,wn)eA 


Example 1.9.10. Suppose two players, say U and V, each toss simultaneously a biased 
coin. At both coins appears “0” (failure) with probability 1 - p and “1” (success) 
with probability p. The pair (k,J) ¢ N? occurs if player U has his first success in 
trial k and player V in trial 1. Each single experiment is described by the geometric 
distribution Gp, hence the model for the combined experiment is (N*, P(N’), G}’). 
Here 


GEA) = D> Gol{KGpCD) = Yo p’a-p)"?, ACN’. 


(k,DeA (k,DeA 


For example, if.A = {(k,k) : k > 1}, then 


ee. 2 
G®?(A) = 2 Ql y2k-2 Pp = Pp . 
P Pd {ep 2p 


Thus in the case of a fair coin, the probability that both players have their first success 
at the same time equals 1/3. 


Example 1.9.11. Toss a biased coin n times. Say the coin is labeled with “0” and “1” 
and p « [0,1] is the probability of the occurrence of “1.” Recording each single result 
separately the describing probability spaces are ({0, 1}, P({0, 1}), P;), 1 < j < n, with 
P,({1}) = p. Which probability space does the combined result describe? 
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Answer: Of course, the sample space is {0,1}" with o-field P(Q). Let w = 
(Wj, ... , Wy) be an arbitrary vector in Q. Then by Proposition 1.9.9 the product measure 
P of the Pjs is characterized by 


P({w}) = Pa({ai}) - - - Pn({w}) = p*( - p)”* 

where k = #{j <n: w;j = 1} = Yj Yj. 

For example, tossing the coin five times, the sequence (0, 0, 1,1, 0) occurs with 
probability p* (1 — p)?. 
1.9.3 Product Measures: Continuous Case 
Here we assume Q; = --- = Qn = R, hence the product sample space is QO = R". 
Furthermore, each Q; = R is endowed with the Borel o-field. Because of Proposition 
1.9.4 the product o-field on Q = R” is given by B(R"). 


The next proposition characterizes the product measure of continuous probability 
measures. 


Proposition 1.9.12. Let P,, ... , P, be probability measures on (IR, B(R)) with respect- 
ive density functions p,, ... , Pn, that is, 


b 
P(({a, b)) = / pixddx, 1<j<n. 
a 
Define p : R" = [0, oo) by 


PX) = pila) - ++ PnOn), x= C4, ....Xn) €R". (1.74) 


The product measure P, ® - - - ® Py, is continuous with (n-dimensional) density p defined 
by (1.74). In other words, for each Borel set A ¢ R" holds 


6 BP OAS Pass - ++ DalXn) dX = dX = dx. 
(Pe--- @P,)A / i pile) » » » Palin) dx ' | px) 
A 


Proof: First note that p is a density of the product measure P; ® -- - ® Pyif 


(P}@--- @P,)(Q) = / p(x) ax 
Q 
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for all boxes Q = [a;, by] x - - - x [an, by]. But this is an immediate consequence of 


[ peoer- [ bes [°c 5 Daly) ky «= = Ax 


by br 
= (/ pia) a (/ Pn(Xn) ax) = Py((ay, bil) - - - Pn(lan, bn) 


=(P,@--- @Pr)([ay, bi] x - - - x (an, dnl) = (P1® -- - @Pn\(Q. 
This completes the proof. a 


Because of its importance let us explain through several examples how Proposition 
1.9.12 applies. Further applications, for example, the characterization of independent 
random variables, will follow in Sections 3 and 8. 


Example 1.9.13. Let the probability measures P;, 1 < j < n, be uniform distributions 
on [a;, B;]. Thus 


non [awe *sh 


O  : otherwise 
henceforth, if x = (x1, ... , Xn), then 


1 
TW. (Ro? xeK 
PX) = pilxr) « - - Pnlxn) = 4 HisnB-p) 
0) : otherwise 
Here K ¢ R" is the box [a,, Bi] x - - + x [@n, Bn]. Since Ten Bi — a) = vol;(K), it follows 
that the product measure P; @ - - - ® P, is nothing else as the (n-dimensional) uniform 
distribution on K as introduced in”! Definition 1.8.10. 


Summary: The product measure of n uniform distributions on intervals [a;, B;] is the 
uniform distribution on the box [ay, Bi] x - - - x [an Brl- 


Example 1.9.14. Assume now P; = --- = Pn = Ej, that is, we want to describe the 
product of n exponential distributions with parameter A > 0. Since p,(s) = Aes if 
s > Oand p,(s) = Oifs < 0, their product Ee possesses the density 


Aig Met 265 cao 8y SO 
Si, --- Sn) = : 
Psi n) 0) : otherwise 
Which random experiment does a describe? Suppose we have n light bulbs of the 
same type with lifetime distributed according to E). Switch on all n bulbs at once and 


21 This result was already used in Example 1.8.11. Indeed, the arrival times t; and tz were described 
by the uniform distributions on [1, 2], thus the pair t = (t, ts) is distributed according to the product 
measure, which is the uniform distribution on [1, 2] x [1, 2]. 
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record the times t,, ... ,t, where the first bulb, the second, and so on burns out. If 
t = (t;, ... ,tn) € R” denotes the generated vector of these times, then for Borel sets 
AC [0, oo)", 


P{t « A} = E2"(A) =A" / er Fens ono Usa: 
A 
For example, if we want to compute the probability for 
A:={(t,...,f):0<t<--- <th}, 


that is, the second bulb burns longer than the first one, the third longer than the 
second, and so on, then 


co Sn Sn-1 S83 S2 
E?"(A) =f" fects f ert / + forts f erin ds, ... dS». 
0 0 0 0 0 


Iterative integration leads to EP"(A) = 1/n!. This is more or less obvious by the follow- 
ing observation. Each order of the times of failure is equally likely. And since there are 
n! different ways to order these times, each order has probability 1/n!. In particular, 
this is true for the order t} < -- + < tp. 


Next is given another example of a product measure that will play a crucial role in 
Sections 6 and 8. 


Example 1.9.15. Let Pj, ... , P, be standard normal distributions. The corresponding 
densities are 


{ <2 
es") 1<jen. 


7 


Thus, by eq. (1.74) the density p of their product (0, 1)®” coincides with 


DG) = 


2m)" 2m)" 


p(x) = 


1/2 
where |x| = ( paz x?) denotes the Euclidean distance of the vector x to 0 (compare 
Section A.4). 


Definition 1.9.16. The probability measure /V(0,1)®”" on B(R") is called the n- 
dimensional or multivariate standard normal distribution. It is described by 


@n(p) — 1 -|x|?/2 
SICH (al en (loth: 
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Figure 1.10: The density of the two-dimensional standard normal distribution. 


Example 1.9.17. Finally we describe the n-fold product measure of the normal distri- 
bution V(u, 07) with p ¢ R and o? > 0. The densities are 


ej -H)? [20° ; 


DOG) = : 
ihe) ae 
hence, as in Example 1.9.15, setting with p = (u, ... ,W) ¢ R", the product (yu, o7)®" 


may be represented as 


1 


2)@n = 
NH, PB) = ora 


/ eH 27 ay Be BIR"). (1.75) 
B 


1.10 Problems 


Problem 1.1. Let A, B, and C be three events in a sample space ©. Express the following 
events in terms of these sets: 


e Only A occurs. e Aand B occur, but C does not. 

e Atleast one of the three events occurs. e Atleast two of the events occur. 

e At most one of the three events occurs. e None of the events occurs. 

e Exactly two of the events occur. e Not more than two of the events occur. 


Problem 1.2. Suppose an urn contains black and white balls. Successively one draws 
n balls out of the urn. The event A; occurs if the ball drawn in the jth trial is white. 
Hereby 1 <j < n. Express the following events B,, ... , By, in terms of the A;s: 
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B, = {All drawn balls are white} 
B> = {At least one of the balls is white} 
B; = {Exactly one of the drawn balls is white} 


B, = {All n balls possess the same color} 
Determine the cardinalities #(B;),j=1, ... ,4. 
Problem 1.3. Let P be a probability measure on (Q, A). Given A, B « A show that 
P(AAB) = P(A) + P(B) - 2P(AnB). 


Problem 1.4. The events A and B possess the probabilities P(A) = 1/3 and P(B) = 1/4. 
Moreover, we know that P(A n B) = 1/6. Compute P(A‘), P(A‘ UB), P(A UB‘), P(An B‘), 
P(AAB), and P(A‘ u B*). 


Problem 1.5 (Inclusion-exclusion formula). Let (O, A, P) be a probability space and 
let Ai, ... , An € A be some (not necessarily disjoint) events. Prove that 


P(UA\) =p SPA, N- +» NAR). 
jel k=1 


1sji<-++<jpsn 


Hint: One way to prove this is by induction over n, thereby using Proposition 1.2.3. 


Problem 1.6. Use Problem 1.5 to investigate the following question: The numbers 
from 1 to n are ordered randomly. All orderings are equally likely. What is the prob- 
ability that there exists an integer m < nso that m is at position m of the ordering? 
Determine the limit of this probability as n > o0. 

Still another version of this problem. Suppose n persons attend a Christmas party. 
Each of the n participants brings a present with him. These presents are collected, 
mixed, and then randomly distributed among the guests. Compute the probability that 
at least one of the participants gets his own present. 


Problem 1.7. Suppose in an urn are N balls; k are white, | are red, and m are black. 
Thus, k+1+m = N. Choose n balls out of the urn. Find a formula for the probability 
that among the n chosen balls are those of all three colors. Investigate this problem if 
1. the chosen ball is always replaced and 
2. ifn < Nand the balls are not replaced. 


Hint: If Ais the event that all three colors appear then compute P(A‘). To this end write 
A‘ = A; U A? U A3 with suitable Ajs and apply Proposition 1.2.4. 
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Problem 1.8. Suppose events A and B occur both with probability 1/2. Prove that then 
P(A U B) = P(AS UBS). (1.76) 
Does (1.76) remain valid assuming P(A) + P(B) = 1 instead of P(A) = P(B) = 5 ? 


Problem 1.9. Three men and three women sit down randomly on six chairs in a row. 
Find the probability that the three men and the three women sit side by side. What is 
the probability that next to each woman sits a man (to the right or to the left)? 


Problem 1.10. Let (Q,.A,P) be a probability space. Prove the following: Whenever 
events A;, Az, ... in A satisfy P(A;) = P(A2) = - - - = 1, then this implies 


(94) =1. 


Problem 1.11. (Paradox of Chevalier de Méré). Chevalier de Méré mentioned that 
when rolling three fair nondistinguishable dice there are 6 different possibilities for 
obtaining either 11 or 12 as the sum. Thus he concluded that both events (sum equals 
11 or sum equals 12) should be equally likely. But experiments showed that this is not 
the case. Why he was wrong and what are the correct probabilities for both events? 


Problem 1.12. A man has forgotten an important phone number. He only remembers 
that the seven-digit number contained three times “1” and “4” and “6” twice each. 
He dials the seven numbers in random order. Find the probability that he dialed the 
correct one. 


Problem 1.13. In an urn aren black and m red balls. One draws successively all n + m 
balls (without replacement). What is the probability that the ball chosen last is red? 


Problem 1.14. A man has in his pocket n keys to open a door. Only one of the keys 

fits. He tries the keys one after the other until he has chosen the correct one. Given an 

integer k compute the probability that the correct key is the one chosen in the kth trial. 
Evaluate this probability in each of the two following cases: 


— The man always discards wrong keys. 
— The man does not discard them, that is, he puts back wrong keys. 


Problem 1.15 (Monty Hall problem). At the end of a quiz the winner has the choice 
between three doors, say A, B, and C. Behind two of the doors there is a goat, behind 
the third one a car. His prize is what is behind the chosen door. 

Say the winner has chosen door A. Then the quizmaster (who knows what is be- 
hind each of the three doors) opens one of the two remaining doors (in our case either 
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door B or door C) and shows that there is a goat behind it. After that the quizmaster 
asks the candidate whether or not he wants to revise his decision, that is, for example, 
if B was opened, to switch from A to C, or if he furthermore chooses door A. 

Find the probabilities to win the car in both cases (switching or nonswitching). 


Problem 1.16. In a lecture room are N students. Evaluate the probability that at least 
two of the students were born at the same day of a year (day and month of their 
births are the same, but not necessarily the year). Hereby disregard leap years and 
assume that all days in a year are equally likely. How big must N be in order that this 
probability is greater than 1/2 ? 


Problem 1.17. In an urn are balls labeled from 0 to 6 so that all numbers are equally 
likely. Choose successively and with replacement three balls. Find the probability that 
the three observed numbers sum up to 6. 


Problem 1.18. When sending messages from A to B on average 3% are transmitted 
falsely. Suppose 300 messages are sent. What is the probability that at least three mes- 
sages are transmitted falsely? Evaluate the exact probability by using the binomial 
distribution as well as the approximate probability by using the Poisson distribu- 
tion. Compute the probability (exact and approximative one) that all messages arrive 
correctly. 


Problem 1.19. The number of accidents in a city per week is assumed to be Poisson 
distributed with parameter 5. Find the probability that next week there will be either 
two or three accidents. How likely is that there will be no accidents? 


Problem 1.20. In a room are 12 men and 8 women. One randomly chooses 5 of the 
20 persons. Given k « {0, ... , 5}, what is the probability that among the five chosen 
are exactly k women? How likely is it that among the five persons are more women 
than men? 


Problem 1.21. Two players A and B take turns rolling a die. The first to roll a “6” wins. 
Player A starts. Find the probability that A wins. Suppose now there is a third player C 
and the order of rolling the die is given by ABCABCA - --. Find each players probability 
of winning. 


Problem 1.22. Two players, say A and B, toss a biased coin where “head” appears with 
probability 0 < p < 1. Winner is who gets the first “head”. A starts, then B tosses twice, 
then again A once, B twice, and so on. Determine the number p for which the game is 
fair, that is, the probability that A (or B) wins is 1/2. 


Problem 1.23. In an urn are 50 white and 200 red balls. 
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(1) Take out 10 balls with replacement. What is the probability to observe four 
white balls? Give the exact value via the binomial distribution as well as the 
approximated one using the related Poisson distribution. 

(2) Next choose 10 balls without replacement. What is the probability to get four 
white balls in this case? 

(3) The number of balls in the urn is as above. But now we choose the balls with 
replacement until for the first time a white ball shows up. Find the probability of 
the following events: 

(a) The first white ball shows up in the fourth trial. 

(b) The first white ball appears strictly after the third trial. 

(c) The first white ball is observed in an even number of trials, that is, in the 
second or in the fourth or in the sixth, and so on trial. 


Problem 1.24. Place successively and independently four particles into five boxes. 
Thereby each box is equally likely. Find the probabilities of the following events: 

A := {Each box contains at most one particle} and B := {All 4 particles are in the 
same box}. 


Problem 1.25. Investigate the following generalization of Example 1.4.44: in urn Up 
are M balls and in urn U; are N balls for some N, M > 1. Choose Up with probability 1—p 
and U;, with probability p, and take out a ball from the chosen urn. Given 1 < m < M, 
find the probability that there are m balls left in Uj when choosing the last ball out 
of U;. How do these probabilities change when 1 < m < N, and we assume that there 
are m balls in U; when choosing the last ball from Up ? 


Problem 1.26. Use properties of the [-function to compute for n « N 
i. el dy and [ x2 PR ay, 
0 0 


Problem 1.27. Prove formula (1.59) that relates the beta and the I-function. 
Hint: Start with 


T(X)F(y) = / / ux ty te" dudv 
00 


and change the variables as follows: u = f(z, t) = zt andv = g(z, t) = z(1- 6), where 
O<z<oandO0<t<l. 


Problem 1.28. Prove that forO<k<n 


n\ _ 1 
(1) (n+1)B(n-k+1,k+1) 


where B(-, -) denotes Euler’s beta function (cf. formula (1.58)). 
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Problem 1.29. Write x € [0,1) as decimal fraction x = 0.x1x2 - -- with xj € {0, ... , 9}. 
Let 


Aj = {x € [0, 1) : x = 1}. 


If P denotes the uniform distribution on [0, 1], compute P(Aj) as well as P (NS Aj). 


Compute the same probabilities if Aj; = {x ¢ [0,1) : x; = m} for some fixed m « 
{O, ... 9}. 


Problem 1.30. Compute the distribution function of the Cauchy distribution (cf. Defin- 
ition 1.6.33). 


Problem 1.31. Let F : R — [0,1] be the distribution function of a probability measure. 
Show that F possesses at most countably many points of discontinuity. Conclude from 
this and Proposition 1.712 the following: If P is a probability measure on B(R), then 
there are at most countably infinite many t < R such that P({t}) > 0. 


Problem 1.32. Let ® be the distribution function of the standard normal distribution 
introduced in eq. (1.62). Show the following properties of ®. 

1. Fort ¢ R holds ®(-f) = 1- O(f). 

2. Ifa>0, then 


N (0, 1)([-a, a]) = 2 (a) -1. 


3. Prove formulas (1.63), that is, 
(t) = 2 E ef (=) and erf(t)=2@(/2t)-1, teR. 
2 if 


4. Compute 


1- W(t) 
Faw t-le-#/2 ° 


Hint: Use l’H6pitale’s rule. 


Problem 1.33 (Bertrand paradox). Consider an equilateral triangle inscribed in a 
circle of radius r > 0. Suppose a chord of the circle is chosen at random. What is the 
probability that the chord is longer than a side of the triangle? 
In this form the problem allows different answers. Why? Because we did not define 
in which way the random chord is chosen. 
1. The “random endpoints” method: Choose independently two uniformly distrib- 
uted random points on the circumference of the circle and draw the chord joining 
them. 
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2. The “random radius” method: Choose a radius of the circle, that is, choose a ran- 
dom angle in [0, 27], choose independently a point on the radius according to 
the uniform distribution on [0, r], and construct the chord through this point and 
perpendicular to the radius. 

3. The “random midpoint” method: Choose a point within the circle according to the 
uniform distribution on the circle and construct a chord with the chosen point as 
its midpoint. 


Answer the above question about the length of the chord in each of the three cases. 
Problem 1.34. A stick of length L > 0 is randomly broken into three pieces. Hereby we 


assume that both points of break are uniformly distributed on [0, L] and independent 
of each other. What is the probability that these three parts piece together to a triangle? 


2 Conditional Probabilities and Independence 


2.1 Conditional Probabilities 


In order to motivate the definition of conditional probabilities, let us start with the 
following easy example. 


Example 2.1.1. Roll a fair die twice. The probability of the event “sum of both rolls 
equals 5” is 1/9. Suppose now we were told that the first roll was an even number. 
Does this additional information make the event “sum equals 5” more likely? Or does 
it even diminish the probability of its occurrence? To answer this question, we apply 
the so-called technique of “restricting the sample space.” Since we know that the 
event B = {First roll is even} had occurred, we may rule out elements in B° and re- 
strict our sample space. Choose B as new sample space. Its cardinality is 18. Moreover, 
under this condition, an event A occurs if and only if A n B does so. Hence, the “new” 
probability of A under condition B, written P(A|B), is given by 


#(ANB)  #(AnB) 


PIB) = #(B) 18 


(2.1) 


In the question above, we asked for P(A|B), where 
A = {Sum of both rolls equals 5} = {(1, 4), (2, 3), G, 2), (4, 1}. 


Since An B = {(2, 3), (4, ))}, we obtain P(A|B) = 2/18 = 1/9. Consequently, in this case, 
condition B does not change the probability of the occurrence of A. 

Define now A as a set of pairs adding to 6. Then P(A) = 5/36, while the condi- 
tional probability remains 1/9. Note that now A n B = {(2, 4), (4, 2)}. Thus, in this case, 
condition B makes the occurrence of A less likely. 


Before we state the definition of conditional probabilities in the general case, let us 
rewrite eq. (2.1) as follows: 


#(ANB)  #(AnB)/36_ P(AnB) 


P(AIB) = Zep) HB) 36 PB) 


(2.2) 


Equation (2.2) gives us a hint to introduce conditional probabilities in the general 
setting. 


Definition 2.1.2. Let (QO, .A, P) be a probability space. Given events A, B « A with 
P(B) > 0, the probability of A under condition B is defined by 


P(A nB) 


P(A|B) = a 7(:) 


(2.3) 
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Remark 2.1.3. If we know the values of P(AnB) and P(B), then formula (2.3) allows us 
to evaluate P(A|B). Sometimes, it happens that we know the values of P(B) and P(A|B) 
and want to calculate P(A n B). In order to do this, we rewrite eq. (2.3) as 


P(A rn B) = P(B) P(AIB). (2.4) 


In this way, we get the desired value of P(A rn B). Formula (2.4) is called the law of 
multiplication. 


The next two examples show how this law applies. 


Example 2.1.4. In an urn are two white and two black balls. Choose two balls without 
replacing the first one. We want to evaluate the probability of occurrence of a black 
ball in the first draw and of a white in the second one. Let us first find a suitable 
mathematical model that describes this experiment. The sample space is given by 
Q = {(b, b), (b, w), (w, b), (w, w)}, and we regard the events 


A : = {Second ball is white} = {(b, w), (w, w)} as well as 
B: = {First ball is black} = {(, b), (b, w)} . 


The event of interest is then An B = {(b, w)}. 

Which probabilities can be directly determined? Of course, the probability of oc- 
currence of B equals 1/2 because the number of white and black balls is the same. 
Furthermore, if B occurred, then in the urn remained two white balls and one black 
ball. Under this condition, event A occurs with probability 2/3, that is, P(A|B) = 2/3. 
Using eq. (2.4), we obtain 


1 


P({(b, w)}) = P(A n B) = P(B) - P(A|B) = ; 


WIN 
| 


Example 2.1.5. Among three non-distinguishable coins are two fair and one is biased. 
Tossing the biased coin “head” appears with probability 1/3, hence “tail” appears 
with probability 2/3. We choose by random one of the three coins and toss it. Find 
the probability to observe “tail” at the biased coin. 

To solve this problem, let us first mention that the sample space Q = {H, T} is not 
adequate to describe that experiment. Why? Because the event {H} may have different 
probabilities depending on occurrence at a biased or at a fair coin. We have to dis- 
tinguish between the appearance of “head” or “tail” at the different types of coins. 
Hence, an adequate choice of the sample space is 


Q. := {(H, B), (T, B), (H, F), (T, F)}. 
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Here, B stands for biased and F assigns that the coin was fair. The event of interest is 
{(T, B)}. Set 


T :={(T,B),(T,F)} aswellas B:={(H,B), (T, B)}. 


Then T occurs if “tail” appears regardless of the type of the coin while B occurs if we 
have chosen the biased coin. Of course, it follows that {(T, B)} = Tn B. Since only one 
of the three coins is biased, we have P(B) = 1/3. By assumption P(T|B) = 2/3, hence an 
application of eq. (2.4) leads to 


P({(T, B)}) = P(B) P(T|B) = = - ; ad 


1 

3 

Next, we present two examples where formula (2.3) applies directly. 

Example 2.1.6. Roll a die twice. One already knows that the first number is not “6.” 

What is the probability that the sum of both rolls is greater than or equal to “10?” 
Answer: The model for this experiment is O = {1,..., 6}? endowed with the uni- 


form distribution P on P(Q). The event B := {First result is not “6”} contains the 30 
elements 


{Q, 1),...56, 1),...,( 6),..., (5, 6)}, 
and if A consists of pairs with sum equal to or larger than 10, then 
A = {(4, 6), (5, 6), (6, 6), (5, 5), (6, 5), (6, 4)}, hence An B = {(4, 6), (5, 6), (5, 5)} . 


Therefore, it follows 


P(AnB) 3/36 1 


BIE) = P(B) 30/36 +10° 


In the case that all elementary events are equally likely, there exists a more direct way 
to evaluate P(A|B). We reduce the sample space as we already did in Example 2.1.1. 


Proposition 2.1.7 (Reduction of the sample space). Suppose the sample space Q. is fi- 
nite and let P be the uniform distribution on P(Q). Then for all events A and non-empty 
Bin Q, we have 


#(A n B) 


P(AIB) = 7 


(2.5) 


Proof: This easily follows from 


P(ANB)  #(AnB)/#Q)  #(AnB) 


PAIR) = Sa) - —a/KO) OB) a 
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Example 2.1.8. We want to investigate Example 2.1.6 once more, this time using 
formula (2.5) directly. Since #(A n B) = 3 and #(B) = 30, we get as before 


MATE) 2°. 4 


ALB) = #(B) 30 10° 


Remark 2.1.9. It is important to state that Proposition 2.1.7 becomes false for general 
probabilities P on P(Q). Formula (2.5) is only valid in the case that P is the uniform 
distribution on P(Q). 


Example 2.1.10. The duration of a telephone call is exponentially distributed with 
parameter A > 0. Find the probability that a call does not last more than 5 minutes 
provided it already lasted 2 minutes. 

Solution: Let A be the event that the call does not last more than 5 minutes, that is, 
A = [0,5]. We know it already lasted 2 minutes, hence event B = [2, oo) has occurred. 
Thus, under condition B, it follows 


EX(AnB)_ E,([2,5]) _e%-e* _ 
E\(B) — Eq ([2, )) ew 


E,(A|B) = ane 


Note the interesting fact that this conditional probability equals E,([0, 3]). What does 
this tell us? It says that the probability that a call lasts no more than another 3 minutes 
is independent of the fact that it already lasted 2 minutes. This means that the dura- 
tion of a call did not “become older.” Independent of the fact that it already lasted 2 
minutes, the probability for talking no more than another 3 minutes remains the same. 


Let us come back to the general case. Fix an event B <« A with P(B) > 0. Then 
Aw P(A|B), AcA, 


is a well-defined mapping from A to [0, 1]. Its main properties are summarized in the 
next proposition. 


Proposition 2.1.11. Let (Q, A, P) be an arbitrary probability space. Then for each B « A 
with P(B) > 0, the mapping A +> P(A|B) is a probability measure on A. It is concentrated 
on B, that is, 


P(B|B) =1 or, equivalently, P(B‘|B)=0. 
Proof: Of course, one has 
P(@|B) = P(@n B)/P(B) = 0 and P(Q|B) = P(Qn B)/P(B) = P(B)/P(B) = 1. 


Thus, it remains to prove that P(-|B) is o-additive. To this end, choose disjoint 
Aj, A2,... in A. Then also A; n B,A2 7 B,... are disjoint and using the o-additivity 
of P leads to 
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es P([U Ai] B) P( U(4j 7B) 
o(Chaya) -“Ueannd) ae) 
— SAP B) 


y = 7 - y-PCAjIB). 
j=l 


P(B) 
Consequently, as asserted, P(-|B) is a probability. Since the identity P(B|B) = 1 is 
obvious, this ends the proof. a 


Definition 2.1.12. The mapping P(- |B) is called conditional probability or also 
conditional distribution (under condition B). 


Remark 2.1.13. The main advantage of Proposition 2.1.11 is that it implies that con- 
ditional probabilities share all the properties of “ordinary” probability measures. For 
example, it holds 


P(A2\A;|B) = P(A2|B) = P(A;|B) provided that A; © Ap 
or 
P(A, U Ap|B) = P(A; |B) + P(A2|B) — P(A; 9 A2|B). 


We come now to the so-called law of total probability. It allows us to evaluate the 
probability of an event A knowing only its conditional probabilities P(A|B;) for certain 
B; ¢ A. More precisely, the following is valid. 


Proposition 2.1.14 (Law of total probability). Let (O,.A, P) be a probability space and 
let By,...,By in A be disjoint with P(B;) > 0 and Uj B; = Q. Then for each A « A holds 


P(A) =) P(B)) P(AIB)) . (2.6) 


jel 


Proof: Let us start with the investigation of the right-hand side of eq. (2.6). By the 
definition of the conditional probability, this expression may be rewritten as 


P(A n B;) 


5 P(B) PIAIB) = = 0) ay PB) 


j=l 


=) P(AnB). (2.7) 
jel 
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The sets B,,...,By, are disjoint, hence so are A mn Bi,...,A By. Thus, the finite 
additivity of P implies 


y-P(An B) = a(e 7) = °((Ua) na] = P(Qn A) = P(A). 


jel jel jel 
Together with eq. (2.7), this proves eq. (2.6). a 


Example 2.1.15. A fair coin is tossed four times. Suppose we observe exactly k times 
“heads” for some k = 0,...,4. According to the observed k, we take k dice and roll 
them. Find the probability that number “6” does not appear. Note that k = 0 means 
that we do not roll a die, hence in this case “6” cannot appear. 

Solution: As sample space, we choose Q = {(k, Y),(k,N) : k = 0,...,4}, where 
(k, Y) means that we rolled k dice and at least at one of them we got a “6.” In the 
same way (k, N) stands for k dice and no “6.” Let N = {(0,N),...,(4,N)} and By = 
{(k, Y), (k, N)}, k = 0,...,4. Then B;, occurs if we observed k “heads.” The conditional 
probabilities equal 


P(N|Bo) = 1, P(N|B,) = 5/6,..., P(N|Bg) = (5/6)*, 


while 
4\ 1 
pon ()4, kaos 
The events Bo,...,B, satisfy the assumptions of Proposition 2.1.14, thus eq. (2.6) 


applies and leads to 


4 4 

1 4 . 1. Ae ft 
P(A) = — = i) 2 = 0.706066743 . 
(A) = 5 » (1) (5/6) = 5, (2 ) (5) 706066743 


Example 2.1.16. Three different machines, M), M2 and M3, produce light bulbs. In a 
single day, M; produces 500 bulbs, Mz 200 and M3 100. The quality of the produced 
bulbs depends on the machines: Among the light bulbs produced by M; are 5% de- 
fective, Mz 10% and M3 only 2%. At the end of a day, a controller chooses by random 
one of the 800 produced light bulbs and tests it. Determine the probability that the 
checked bulb is defective. 

Solution: The probabilities that the checked bulb was produced by Mj, M2 or M3 
are 5/8, 1/4 and 1/8, respectively. The conditional probabilities for choosing a defective 
bulb produced by M;, M2 or M3 were given as 1/20, 1/10 and 1/50, respectively. If D is 
the event that the tested bulb was defective, then the law of total probability yields 


5 1 1 1 1 1 4 


P(D) = = - +o—+—+-- = = 0.05875. 
8 20 4 10 8 50 800 
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Let us look at Example 2.1.16 from a different point of view. When choosing a light 
bulb out of the 800 produced, there were certain fixed probabilities whether it was 
produced by M,, M> or M3, namely with probabilities 5/8, 1/4 and 1/8. These are the 
probabilities before checking a bulb. Therefore, they are called a priori probabilities. 
After checking a bulb, we obtained the additional information that it was defective. 
Does this additional information change the probabilities which of the M;, M2 or M3 
produced it? More precisely, if as above D occurs if the tested bulb is defective, then 
we now ask for the conditional probabilities P(M,|D), P(M2|D) and P(M3|D). To under- 
stand that these probabilities may differ considerably from the a priori probabilities, 
imagine that, for example, M,; produces almost no defective bulbs. Then it will be very 
unlikely that the tested bulb has been produced by Mj, although P(M,) may be big. 

Because P(M,|D), P(M>2|D) and P(M3|D) are the probabilities after executing the 
random experiment (choosing and testing the bulb), they are called a posteriori 
probabilities. 

Let us now introduce the exact and general definition of a priori and a posteriori 
probabilities. 


Definition 2.1.17. Suppose there is a probability space (Q, A, P) and there are dis- 
joint events Bi,...,Bn € A satisfying QO = Up B;. Then we call P(B;),..., P(Bn) 
the a priori probabilities of B,,...,Bn. Let A « A with P(A) > 0 be given. Then 
the conditional probabilities P(B,|A), ...,P(B,|A) are said to be the a posteriori 
probabilities, that is, those after the occurrence of A. 


To calculate the a posteriori probabilities, the next proposition turns out to be very 
useful. 


Proposition 2.1.18 (Bayes’ formula). Suppose we are given disjoint events B, to By, sat- 
isfying Uj B; = Q.and P(B;) > 0. Let A be an event with P(A) > 0. Then for eachj <n 
the following equation holds: 


P(B;) P(A|B)) 


>", P(B)P(AIB)) © (2.8) 


P(B\|A) = 


Proof: Proposition 2.1.14 implies 


>> P(Bi)PCAIBi) = P(A). 


i=1 


Hence, the right-hand side of eq. (2.8) may also be written as 


P(B)) P(AnB)) 
P(B)P(A|B;) 7 PB) P(ANB) © P(BIA) 
P(A) P(A) P(A) ; 


and the proposition is proven. a 


2.1 Conditional Probabilities —— 93 


Remark 2.1.19. In the case P(A) is already known, Bayes’ formula simplifies to 


P(B;)P(A|B;) 


P(B)|A) = PA)” 


=. 0, (2.9) 


Remark 2.1.20. Let us treat the special case of two sets partitioning Q. If B; = B, then 
necessarily By = B°, hence Q = B u B‘. Then formula (2.8) looks as follows: 


P(B)P(A|B) 
P(B)P(A|B) + P(BP(A|BS) 


P(BIA) = (2.10) 


and 


P(B‘)P(A|B‘) 


PBI) = SceypAIB) + POPS) * 


(2.11) 


Again, if the probability of A is known, the denominators in eqs. (2.10) and (2.11) may 
be replaced by P(A). 


Example 2.1.21. Let us use Bayes’ formula to calculate the a posteriori probabilities 
in Example 2.1.16. Recall that D occurred if the tested bulb was defective. We already 
know P(D) = 47/800, hence we may apply eq. (2.9). Doing so, we get 


: _ PW@L)P(O|M) _ 5/8 -1/20 _ 
(Mi|D) >(D) 77/800 25/47 

= _ P@D)P(D|M2) _ 1/4-1/10 | 
wie P(D) a7jao0 O47 

* _ PCW3)P(D|M3) _ 1/8 -1/50 _ 
(M3|D) PO) G7IB00 2/47 . 


By assignment of the problem, the a priori probabilities were given by P(M,) = 5/8, 
P(M2) = 1/4 and P(M3) = 1/8. In the case that the tested light bulb was defective, 
these probabilities change to 25/47, 20/47 and 2/47. This tells us that it becomes less 
likely that the tested bulb was produced by M; or M3; their probabilities diminish 
by 0.0930851 and 0.0824468, respectively. On the other hand, the probability of M2 
increases by 0.175532. 

Finally, note that Proposition 2.1.11 implies that the sum of the a posteriori 
probabilities has to be 1. Because of 25/47 + 20/47 + 2/47 = 1, this is true in that example. 


Example 2.1.22. In order to figure out whether or not a person suffers from a certain 
disease, say disease X, a test is assumed to give a clue. If the tested person is sick, 
then the test is positive in 96% of cases. If the person is well, then with 94% accuracy 
the test will be negative. Furthermore, it is known that 0.4% of the population suffers 
from X. 

Now a person, chosen by random, is tested. Suppose the result was positive. Find 
the probability that this person really suffers from X. 
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Solution: As sample space, we may choose Q = {(X,p), (X,n), (X‘, p), (X°, n}, 
where, for example, (X, n) means the person suffers from X and the test was negative. 
Set A := {(X, p), (X°, p)}. Then A occurs if and only if the test turned out to be positive. 
Furthermore, event B := {(X, p), (X, n)} occurs in the case that the tested person suffers 
from X. Known are 


P(A|B) = 0.96 , P(A|B°) =0.06 and P(B)=0.004, hence P(B‘) = 0.996. 


Therefore, by eq. (2.10), the probability we asked for can be calculated as follows: 


P(B)P(A|B) 
P(B)P(A|B) + P(B°)P(A|B‘) 
7 0.004 - 0.96 _ 0.00384 
0.004 - 0.96 + 0.996 -0.06 0.0636 


P(BIA) = 


= 0.0603774 . 


That tells us that it is quite unlikely that a randomly chosen person with A positive test 
is really sick. The chance for this being true is only about 6%. 


2.2 Independence of Events 


What does it mean that two events are independent or, more precisely, that they occur 
independently of each other? To get an idea, let us look at the following example. 


Example 2.2.1. Roll a fair die twice. Event B occurs if the first number is even while 
event A consists of all pairs (x1, x2), where x2 = 5 or xX = 6. It is intuitively clear 
that these two events occur independently of each other. But how to express this 
mathematically? To answer this question, think about the probability of A under the 
condition B. The fact whether or not B occurred has no influence on the occurrence 
of A. For the occurrence or nonoccurrence of A, it is completely insignificant what 
happened in the first roll. Mathematically this means that P(A|B) = P(A). Let us check 
whether this is true in this concrete case. Indeed, it holds P(A) = 1/3 as well as 


P(ANB) _ 6/36 


PIB) = Sa) = ap 


= 1/3. 


The previous example suggests that independence of A of B could be described by 


P(A nB) 


P(A) = P(AIB) = B 


(2.12) 


But formula (2.12) has a disadvantage, namely we have to assume P(B) > 0 to ensure 
that P(A|B) exists. To overcome this problem, rewrite eq. (2.12) as 


P(A 1B) = P(A) P(B). (2.13) 


In this form, we may take eq. (2.13) as a basis for the definition of independence. 
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Definition 2.2.2. Let (Q, A, P) bea probability space. Two events A and Bin A are 
said to be (stochastically) independent provided that 


P(A 1B) = P(A) - P(B). (2.14) 


In the case that eq. (2.14) does not hold, the events A and B are called (stochastic- 
ally) dependent. 


Remark 2.2.3. In the sequel, we use the notations “independent” and “dependent” 
without adding the word “stochastically.” Since we will not use other versions of 
independence, there should be no confusion. 


Example 2.2.4. A fair die is rolled twice. Event A occurs if the first roll is either “1” or 
“2” while B occurs if the sum of both rolls equals 7. Are A and B independent? 
Answer: It holds P(A) = 1/3, P(B) = 1/6 as well as P(A nB) = 2/36 = 1/18. Hence, we 
get P(A n B) = P(A) - P(B) and A and B are independent. 
Question: Are A and B also independent if A is as before and B is defined as a set 
of pairs with sum 4? 


Example 2.2.5. In an urn, there are n, n > 2, white balls and also n black balls. One 
chooses two balls without replacing the first one. Let A be the event that the second 
ball is black while B occurs if the first ball was white. Are A and B independent? 

Answer: The probability of B equals 1/2. To calculate P(A), we use Proposition 
2.1.14. Then we get 


1 n at n-1 1 
2 2n-1 2 2n-1 2 


P(A) = P(B)P(A|B) + P(B)P(A|B‘) = , 
hence, P(A) - P(B) = 1/4. 
On the other hand, we have 
n n 1 


P(A NB) = POB)P(AIB) = 5 - = aoa 40 


Consequently, A and B are dependent. 


Remark 2.2.6. Note that, ifn — oo, then 


n_ .13_ pape). 


P(AnB) = 
4n-2 4 


This tells us the following: if n is big, then A and B are “almost” independent or, 
equivalently, the degree of dependence between A and B is very small. This question 
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will be investigated more thoroughly in Chapter 5 when a measure for the degree of 
dependence is available. 


Next, we prove some properties of independent events. 


Proposition 2.2.7. Let (O,.A, P) be a probability space. 
1. For any A €« A, the events A and @ as well as A and Q. are independent". 
2. If A and Bare independent, then so are A and B¢ as well as A‘ and B‘. 


Proof: We have 
P(A ng) = P(@) = 0 = P(A) -0 = P(A): P@), 


hence, A and @ are independent. 
In the same way follows the independence of A and Q by 


P(A nQ) = P(A) = P(A)-1= P(A) - PQ). 


To prove the second part, assume that A and B are independent. Our aim is to show 
that A and B‘ are independent as well. We know that 


P(A 1B) = P(A) P(B) 
and we want to show that 
P(A n B‘) = P(A) P(B‘). 


Let us start with the right-hand side of the last equation. Using the independence of A 
and BandAnB CB, it follows that 


P(A) P(B‘) = P(A)(1 - P(B)) = P(A) - P(A) - P(B) 
= P(A) - P(AnB) = P(A\(AnB)). (2.15) 


Since A\(A n B) = A\B = An B* from eq. (2.15), we derive 
P(A) - P(B = P(AnB‘). 


Consequently, as asserted, A and B° are independent. 

If A and B are independent, then so are B and A, and as seen above, so are B 
and A‘. Another application of the first step, this time with A° and B shows that also 
A‘ and B‘ are independent. This completes the proof. a 


Suppose we are given n events Aj,...,A, in A. We want to figure out when they are 
independent. A first possible approach could be as follows. 


1 For a more general result, compare Problem 2.10. 
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Definition 2.2.8. Events A;,...,An are said to be pairwise independent if, 
whenever i # j, then 


P(A; 9 Aj) = P(Aj) - P(A). 

In other words, for all 1 < i < j < 1 the events A; and A; are independent. 
Unfortunately, for many purposes, the property of pairwise independence is too weak. 
For example, as we will see next, in general it does not imply the important equation 

P(A, N+: A Ay) = P(A) +: P(An). (2.16) 
Example 2.2.9. Rolla die twice and define events Aj, Az and A; as follows: 


Ay := {2,4, 6} x {1,..., 6} 
Az := {1,..., 6} x {1, 3, 5} 
A3 := {2,4, 6} x {1, 3, 5} u {1, 3, 5} x {2, 4, 6}. 
Verbally this says that A; occurs if the first roll is even, Az occurs if the second one is 


odd and A; occurs if either the first number is odd and the second is even or vice versa. 
Direct calculations give P(A;) = P(A2) = P(A3) = 1/2 as well as 


P(A; 9 Az) = P(A, 1 A3) = P(Az 7 A3) = 


Fle 


Hence, A;, Az and A3 are pairwise independent. 
Since 


Aj nA? n A3 =A; NA? 
it follows 


P(A, n Az N.A3) = P(A, 9 Az) = ; # < = P(A1)- P(Az)- P(A3) . 


|r 


So, we found three pairwise independent events for which eq. (2.16) is not valid. 
After mentioning that pairwise independence of Aj, ..., An does not imply 
P(A, 9 +N An) = P(A) + P(An); (2.17) 


it makes sense to ask whether or not pairwise independence can be derived from 
eq. (2.17). The next example shows that, in general, this is also not true. 
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Example 2.2.10. Let QO = {1,..., 12} be endowed with the uniform distribution P, that 
is, for any A ¢ O we have P(A) = #(A)/12. Define events A;, A> and A3 as A, := {1,..., 9}, 
A> := {6, 7,8, 9} and A3 := {9, 10, 11, 12}. Direct calculations give 


9 3 4 1 4 1 
Faj==—, PAj]e = d Pays <_, 
40= 5 4 (42) = 5 a (43) = 5 3 
Moreover, we have 
P(A; nA As) = P({9}) = 2 = 2-2.) = (ay) - P(A) - PCAs) 
1 2 3 D 4 3 3 1 2 3) 5 


hence eq. (2.17) is valid. But, because of 


1 1 
P(A; n Az) = P(A2) 3 # rn P(Aj) - P(A2), 
the events A), A> and A; are not pairwise independent. 


Remark 2.2.11. Summing up, Examples 2.2.9 and 2.2.10 show that neither pairwise 
independence nor eq. (2.17) are suitable to define the independence of more than two 
events. Why? On the one hand, independence should yield eq. (2.17) and, on the other 
hand, whenever A;,..., A, are independent, then so should be any subcollection of 
them. In particular, independence should imply pairwise independence. 


A reasonable definition of independence of n events is as follows. 


Definition 2.2.12. The events A;,...,A, are said to be independent provided 
that for each subset I € {1,...,n}we have 


P( (i) =]]P@). (2.18) 


ieI ie] 


Remark 2.2.13. Of course, it suffices that eq. (2.18) is valid for sets J ¢ {1,...,n} 
satisfying #(J) > 2. Indeed, if #(J) = 1, then eq. (2.18) holds by trivial reason. 


Remark 2.2.14. Another way to introduce independence is as follows: For all m > 2 
and all1 < i; <---<im <n, it follows 


P(A; 7 Aig) = P(A) Pig) - 


Identify I with {i,, ... , im} to see that both definitions are equivalent. 
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At a first glance, Definition 2.2.12 looks complicated; in fact, it is not. To see this, 
let us once more investigate the case n = 3. Here exist exactly four different subsets 
I ¢ {1, 2,3} with #(J) > 2. These are J = {1,2}, J = {1, 3}, J = {2,3} and J = {1, 2, 3}. Con- 
sequently, three events A;, A and A3 are independent if and only if the four following 
conditions hold at once: 


P(A; 1 Az) = P(Ai) - P(A2) 
P(A, 1. A3) = P(A) - P(A3) 
P(A2 1 A3) = P(A2)- P(A3) as wellas 
P(A, 9 Az 1. A3) = P(A) - P(Az) - P(A3). 


Examples 2.2.9 and 2.2.10 show that all four equations are really necessary. None of 
them is a consequence of the other three ones. 
Independence of n events possesses the following properties: 


Proposition 2.2.15. 

1. Let Aj,...,An be independent. For any J ¢ {1,...n}, the events {Aj : j € J} are 
independent as well. In particular, independence implies pairwise independence. 

2. Foreach permutation m of {1, ...,n}, the independence of Aj, ..., An implies that of? 
An)» Sat » Ann): 

3. Suppose for each 1 < j < nholds either B; = A; or Bj = Af. Then the independence of 
Aj, ...,An implies that of By,..., Bn. 


Proof: The first two properties are an immediate consequence of the definition of 
independence. 

To prove the third assertion, reorder Aj,..., An such that? B, = Aj. Ina first step, 
we show that Af, Az,..., A, are independent as well, that is, we have B; = Aj, By = Az 
and so on. Given J € {1,..., n}, it has to hold 


P( (i) -]]®@). 
ie] iel 


In the case 1 ¢ I, this follows by the independence of A;,...,An. If 1 €¢ I, we apply 
Proposition 2.2.7 with* A; and C = ()jengyAi = Mier Bi. Then Aj = By and C are 
independent as well. Hence, by the independence of A2,..., An, we get 


P( (Bi) = P@in ©) = PB) PC) = PB): [] PB) =] Pe). 


ie] iel\{1} ie] 


2 For example, in the case n = 3 with Aj, Az, A3 also A3, Ap, A; or Az, A3, A; are independent. 
3 Ifall B; = Aj, there is nothing to prove. 
4 Why are A; and C independent? Give a short proof. 
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The general case then follows by reordering the Ajs and by an iterative application 
of the first step. This is exactly the procedure we did in the proof of Proposition 2.2.7 
when verifying the independence of A‘ and B‘ for independent A and B. a 


The next two examples show how independence of more than two events appears in 
a natural way. 


Example 2.2.16. Toss a fair coin n times. Let us assume that the coin is labeled with 
“O” and “1.” Choose a fixed sequence (aj)jy of numbers in {0, 1} and suppose that the 
event A; occurs if in the jth trial a; comes up. 

We claim now that A;,..., A, are independent. To verify this, choose a subset I ¢ 
{1,...,n} with #1) = k for some k = 2,...,n. The cardinality of (),.; Ai equals gn-k, 
Why? At k positions the values of the tosses are fixed; at n — k positions, they still may 
be either “O” or “1.” Consequently, 


The same argument as before gives #(A;) = 2-1, hence P(Aj) = 17,1 <j <n. 
Consequently, it follows 


1 #(D) 4 
| [Pa = (5) =27k, (2.20) 


; 2 
ie] 


Combining eqs. (2.19) and (2.20) gives 


P (M4) =|[PC). 


ieI ie] 


and since I was arbitrary, the sets A;,..., An are independent. 


Remark 2.2.17. Even the simple Example 2.2.16 shows that it might be rather com- 
plicated to verify the independence of n given events. For example, if we modify the 
previous example by taking a biased coin, then the Ajs remain independent, but the 
proof becomes more complicated. 


Example 2.2.18. A machine consists of n components. These components break down 
with certain probabilities pj, ..., p,. Moreover, we assume that they break down inde- 
pendently of each other. Find the probability that a chosen machine stops working. 
Before answering this question, we have to determine the conditions. 
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Case 1: The machine stops working provided at least one component breaks down. 
Let M be the event that the machine stops working. Ifj < n, assume A; occurs if 
component j breaks down. By assumption, P(A;) = p;. Since 


n 
M=\|J4;, 
jel 
by the independence? it follows that 
n n n 
P(M) = 1- P(M°)=1-P| ()AS) =1-]] Ps) =1-] ]a-p). (2.21) 
jel j-l jr 


Case 2: The machine stops working provided all n components break down. 
Using the same notation as in case 1, we now have 


n 
M ={ )4j. 
jel 
Hence, by the independence we obtain 
n n 
P(M) =P | ( \4;| =| [p;- (2.22) 
j=l jel 


Remark 2.2.19. Formula (2.21) tells us the following: If among the n components there 
is one of bad quality, say the component jo, then p;, is close to one; hence, 1 — pj, is 
close to zero, and so is jad — p;). Because of eq. (2.21), P(M) is large, and so the 
machine breaks down with large probability. 

In the second case, the conclusion is as follows: if among the n components there 
is one of high quality, say component jo, then p;, is small and so is Tj Dj. By eq. (2.22), 
P(M) is also small, hence it is very unlikely that the machine stops working. 


2.3 Problems 


Problem 2.1. The chance to win a certain game is 50%. One plays six games. Find the 
probability to win exactly four games. Evaluate the probability of this event under the 
condition to win at least two games. Suppose one had won exactly one of the two first 
games. Which probability has the event “winning 4 games” under this condition? 


5 In fact, we also have to use Proposition 2.2.15. 
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Problem 2.2. Toss a fair coin six times. Define events A and B as follows: 


A = {“Head” appears exactly 3 times} 
B = {The first and the second toss are “head”} 


Evaluate P(A), P(A|B) and P(A|B‘). 


Problem 2.3. Let A and B be as in Problem 1.24, that is, A occurs if each box contains 
at most one particle while B occurs if all four particles are in the same box. 
Find now P(A|C) and P(B|C) with C = {The first box remains empty}. 


Problem 2.4. Justify why Propositions 2.1.14 and 2.1.18 (Law of total probability and 
Bayes’ formula) remain valid for infinitely many disjoint sets B,, Bz,... satisfying 
P(B;) > O and Us B; = Q. 

Prove that Proposition 2.1.14 also holds without assuming )j_, Bj = . But then 
we have to suppose A ¢ Lj", Bj. 


Problem 2.5. To go to work, a man can either use the train, the bus or his car. He 

chooses the train 50%, the bus 30% and the car 20% of work days. If he takes the 

train, he arrives on time with probability 0.95. By bus, he is on time with probability 

0.8, and by car with probability 0.7. 

1. Evaluate the probability that the man is at work on time. 

2. How big is this probability given the man does not use the car? 

3. Assume the man arrived at work on time. What are then the probabilities that he 
came by train, bus or car? 


Problem 2.6. Let U;, U2 and U3 be three urns containing each five balls. Urn U; con- 

tains four white balls and one black ball, U2 has three white balls and two black balls 

and, finally, U3 contains two white balls and three black balls. Choose one urn by ran- 

dom (each urn is equally likely) and without replacing the first ball take two balls out 

of the chosen urn. 

1. Givea suitable sample space for this random experiment. 

2. Find the probability to observe two balls of different color. 

3. Assume the chosen balls were of different color. What are the probabilities that 
the balls were taken out of U;, U2 or U3? 


Problem 2.7. Suppose we have three nondistinguishable dice. Two of them are fair, 

the other one is biased. There the number “6” appears with probability 1/5 while all 

other numbers have probability 4/25. We choose by random one of the dice and roll it. 

1. Find a suitable sample space for the description of this experiment. 

2. Give the probability of occurrence of {1} to {6} in that experiment. 

3. Suppose we have observed the number “2” on the chosen die. Find the probability 
that this die was the biased one. 
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Problem 2.8. 
1. Let (QO, .A, P) be a probability space. Given events A;,...,An prove the following 
chain rule for conditional probabilities: 


P(A, 9+ An) = P(Ay)P(A2|A1) + P(An|A1 9 Az 9 ++ Ana). 


Hereby, we assume that all conditional probabilities are well defined. 

2. Choose by random three numbers out of 1 to 10 without replacement. Find the 
probability that the first number is even, the second one is odd and the third one 
is again even. 

3. Compare this probability with that of the following event: among three randomly 
chosen numbers in {1,..., 10} are exactly two even and one odd. 


Problem 2.9. Three persons, say X, Y and Z, stand randomly in a row. All ordering 
are assumed to be equally likely. Event A occurs if Y stands on the right-hand side of 
X while B occurs in the case that Z is on the right-hand side of X. Hereby, we do not 
suppose that Y and X nor that Z and X stand directly next to each other. Are events A 
and B independent or dependent? 


Problem 2.10. Prove the following generalization of part 1 in Proposition 2.2.7. Let A « 
A be an event with either P(A) = 0 or P(A) = 1. Then for any B « A, the events A and B 
are independent. 


Problem 2.11. Let (Q,.A,P) be a probability space. Given independent events 
Aj,...,Anin A prove that 


n n 


P( Us) =1-| | (@-P()) . (2.23) 


jl jel 


Use 1-x <e™%, x > 0, to derive from eq. (2.23) the following: 
If independent events® Aj, Az, ... satisfy pane P(A;) = oo, then P (Ul Aj) =1. 


Ay are independent. 
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Problem 2.12. An electric circuit (see the above figure) contains four switches A, B, C 
and D. Each of the switches is independently open or closed (then electricity flows). 
The switches are open with probability 1 - p and closed with probability p. Here, 0 < 
p < 1is given. Find the probability that electricity flows from the left-hand side to the 
right-hand one. 


Problem 2.13. Let (Q, A, P) bea probability space. Suppose A and Bare disjoint events 
with P(A) > 0 and P(B) >0. Is it possible that A and B are independent? 


Problem 2.14. Let A, B and C be three independent events. 
1. Show that An Band C are independent as well. 
2. Even more, show that the independence of A, B and C implies that of AU B and C. 


Problem 2.15. 

1. Suppose that A and C as well as B and C are independent. Furthermore, assume 
AnB =@. Show that A u B and C are independent as well. 

2. Give an example that shows that the preceding assertion becomes false without 
the assumption An B= @. 


Remark: To construct such an example, because of Problem 2.14, the events A, B and 
C cannot be chosen to be independent. Therefore, the sets defined in Example 2.2.9 


are natural candidates for such an example. 


Problem 2.16. Suppose P(A|B) = P(A|B‘) for some events A and B with 0 < P(B) < 1. 
Does this imply that A and B are independent? 


Problem 2.17. Is it possible that an event A is independent of itself? If yes, which 
events A have this property? Similarly, which A are independent of A‘ ? 


Problem 2.18. Let A, B and C be three independent events with 


P(A) = P(B) = P(C) = 


W]e 


Evaluate 


P(AnB)u(An0C)). 


3 Random Variables and Their Distribution 


3.1 Transformation of Random Values 


Assume the probability space (Q, A, P) describes a certain random experiment, for 
example, rolling a die or tossing a coin. If the experiment is executed, a random result 
w € Q shows up. In a second step we transform this observed result via a mapping 
X :Q- R. In this way we obtain a (random) real number X(w). Let us point out that X 
is a fixed, nonrandom function from Q into R; the randomness of X(w) stems from the 
input w « Q. 


Example 3.1.1. Toss a fair coin, labeled on one side with “0” and on the other side 
with “1”, exactly n times. The appropriate probability space is (QO, P(Q), P), where Q = 
{0, 1}" and P is the uniform distribution on Q. The result of the experiment is a vector 
W = (W, ... , Wn) with w; = 0 or w; = 1. Let X from Q > R be defined by 


X(w) = X(wy, ... , Wn) = Wp +--+ +O. 


Then X(w) tells us how often “1” occurred, but we do no longer know in which order 
this happened. Of course, X(w) is random because, if one tosses the coin another n 
times, it is very likely that X attains a value different from that in the first trial. 

Here we state the most important question in this topic: how are the values of X 
distributed? In our case X attains the values k < n with probabilities (;)2”. 


Example 3.1.2. Roll a fair die twice. The sample space describing this experiment 
consists of pairs w = (w1,W2), where w;,W2 «€ {1, ... ,6}. Now define the mapping 
X:Q— Rby XW) := max{w}, w2}. Thus, instead of recording the values of both rolls, 
we are only interested in the larger one. 

Other possible transformations are, for example, X;(w) := min{w , w2} or also 
X>(w4, Ww?) 2= W1+ W2. 


Let A « A be an event. Recall that this event A occurs if and only if we observe an 
w € A. Suppose now X : Q > Ris a given mapping from Q into R and let B ¢ R be 
some event. When do we observe an w ¢€ © for which we have X(w) « B or, equivalently, 
when does the event 


{X « B} := {w ¢ QO: X(w) € B} 


occur? To answer this question, let us recall the definition of the preimage of B with 
respect to X as given in eq. (A.1): 


XB) := {we€Q:X(w) « B}. 
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We observe an w ¢€ © for which X(w) ¢ B if and only if w ¢ X~1(B). In other words, 
the event {X ¢ B} occurs if and only if X-1(B) does so. Consequently, the probability to 
observe an w € Q with X(w) ¢ B should be P(X~!(B)). But to this end, we have to know 
that X-1(B) € A; otherwise P(X~1(B)) is not defined at all. Thus, a natural condition 
for X is X-1(B) ¢ A for “sufficiently many” subsets B ¢ R. The precise mathematical 
condition reads as follows. 


Definition 3.1.3. Let (O,.A,P) be a probability space. A mapping X : QO - R 
is called a (real-valued) random variable (sometimes also called random real 
number), provided that it satisfies the following condition: 


Be BIR) => xX BcA. (3.1) 


Verbally, this condition says that for each Borel set B ¢ R, its preimage X _'(B) has 
to be an element of the o-field A. 


Remark 3.1.4. Condition (3.1) is purely technical and will not be important later on. 
But, in general, it cannot be avoided, at least if A # P(Q). On the contrary, if A = P(Q), 
for example, if either Q is finite or countably infinite, then every mapping X:Q > R 
is a random variable. Indeed, by trivial reason, in this case the condition X-1(B) ¢ Ais 
always satisfied. 


Remark 3.1.5. In order to verify that a given mapping X : OQ + Ris arandom variable, 
it is not necessary to show X~!(B) € A for all Borel sets B ¢ R. It suffices to prove this 
only for some special Borel sets B. More precisely, the following proposition holds. 


Proposition 3.1.6. A function X : Q > Risa random variable if and only if, for allt < R, 
we have 


X7!((-00, t]) = {we Q:X(w)< the A. (3.2) 


The assertion remains valid when we replace the intervals (—oo, t] with intervals of the 
form (-00, t), or we may take intervals [t, oo) and also (t, 00). 


Proof: Suppose first that X is a random variable. Given t ¢ R, the interval (—0o, flisa 
Borel set, hence X “*(=e0, t}) ¢ A. Thus, each random variable satisfies condition (3.2). 

To prove the converse implication, let X be a mapping from Q to R satisfying 
condition (3.2) for each t « R. Set 


C :={C€ B(R): XC) € Ah. 
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Ina first step, one proves! that C is a o-field. Moreover, property (3.2) implies (—oo, ¢] ¢ 
C for each t ¢ R. But B(R) is the smallest o-field containing all these intervals. Since C 
is another o-field containing the intervals (—co, ft], it has to be larger? than the smallest 
one, that is, we have C 2 B(R). In other words, every Borel set belongs to C or, equival- 
ently, for all B ¢ B(R) it follows X~!(B) ¢ A. Thus, as asserted, X is a random variable. 
The proof for intervals of the other types goes along the same line. Here one has to use 
that these intervals generate the o-field of Borel sets as well. | 


3.2 Probability Distribution of a Random Variable 


Suppose we are given a random variable X : Q > R. We define now a mapping Px 
from B(R) to [0, 1] as follows: 


Px (B) := P(X*(B)) = Pw «QO: X(w)€ B}, Be BR). 


Observe that Py is well-defined. Indeed, since X is arandom variable, for all Borel sets 
Bc Rwe have X'(B) € A, hence P(X~'(B)) makes sense. 
To simplify the notation, given B « B(R), we will often write 


P{X « B} = Plw <Q: X(w) « B}. 
That is generally used and does not lead to any confusion. Having said this we may 
now define Px also by 
P(B) = P{X « B}. 

A first easy example shows how Py is calculated in concrete cases. Other, more 
interesting examples will follow after some necessary preliminary considerations. 
Example 3.2.1. Toss a fair coin, labeled on one side by “O” and on the other side 
by “1,” three times. The sample space is Q = {0,1}? with the uniform distribution P 


describing probability measure. Let the random variable X on Q be defined by 


X(w) := W1 + W2+W3 Whenever w = (Wj, W2, W3) € QO. 


1 Use Proposition A.2.1 to verify this. 
2 By the construction of C, it even coincides with B(R). 
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It follows 
Px({0}) = P{X = 0} = P{(0, 0, 0)}) “5 
Px({1}) = PIX = 1} = P(A, 0,0), (0, 1, 0), (0,0, DP = ; 


Py ((2}) = P(X = 2} = P({(1, 1,0), (0, 1,9), (1,0, DP = : 


Px((3}) = P(X = 3} = PUG, 1, DP) = ; 


Of course, these values describe the distribution of X completely. Indeed, whenever 
BCR, then 


3 
Px(B) = ) > Px({k}). 


k=0 
keB 


The proof of the next result heavily depends on properties of the preimage proved in 
Proposition A.2.1. 


Proposition 3.2.2. Let (O,.A,P) be a probability space. For each random variable X : 
Q — R the mapping Px : B(R) = [0,1] is a probability measure. 


Proof: Using property (1) in Proposition A.2.1 one easily gets 

Px(B) = P(X"(@)) = P(@) = 0 
as well as 

Px(R) = P(X"*(R)) = P(Q) =1. 
Thus it remains to verify the o-additivity of Py. Take any sequence of disjoint Borel 
sets By, By, ... in R. Then also X-1(B,), X-'(B2), ... are disjoint subsets of Q. To see 
this apply Proposition A.2.1, which, if i # j, implies 

X(B) n XB) = X1(B; n B)) = X1@) =@. 


Another application of Proposition A.2.1 and of the o-additivity of P finally gives 
Px( U Bj) = P(x Us)) = P( Ux"e)) 
jel j=l jel 
= 9° P(X"(B))) = }> Px(Bi). 
jel j=l 


Hence, Px is a probability measure as asserted. ia 
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Definition 3.2.3. The probability measure Py on (R, B(R)) defined by 
Px(B) := P(X1(B)) = Pw Q: XW) B}=P{X< B}, Be BR), 


is called probability distribution of X (with respect to P) or, in short, only 
distribution of X. 


Remark 3.2.4. The distribution Px is the most important characteristic of a random 
variable X. In general, it is completely unimportant how a random variable is defined 
analytically; only its distribution matters. Thus, two random variables with identical 
distributions may be regarded as equivalent because they describe the same random 
experiment. 


Remark 3.2.4 leads us to the following definition: 


Definition 3.2.5. Two random variables X, and X2 are said to be identically dis- 
tributed provided that Px, = Px,. Hereby, it is not necessary that X; and X2 are 
defined on the same sample space. Only their distributions have to coincide. In 
the case of identically distributed X; and X2 one writes X; a X. 


Example 3.2.6. Toss a fair coin, labeled on each side by “0” or “1,” twice. Let X; be 
the value of the first toss and X> that of the second one. Then 


P{X, = 0} = P(X = 0} ; P{X, = 1} = P{X = 1}. 


Hence, X; and X> are identically distributed or X, a X> . Both random variables de- 
scribe the same experiment, namely to toss a fair coin one time. Now, toss the coin a 
third time and let X3 be the result of the third trial. Then we also have X; Z X3, but note 
that X; and X; are defined on different sample spaces. 


Next, we state and prove some general rules for evaluating the probability distribution 
of a given random variable. Here we have to distinguish between two different types 
of random variables, namely between discrete and continuous ones. Let us start with 
the discrete case. 


Definition 3.2.7. A random variable X is discrete provided there exists an at most 
countably infinite set D c Rsuchthat X:Q—-D. 

In other words, a random variable is discrete if it attains at most countably 
infinite many different values. 
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Remark 3.2.8. If arandom variable X is discrete with values in D c R, then, of course, 
Px(D) = P{X « D} =1. 


Consequently, in this case its probability distribution Px is a discrete probability 
measure on R. In general, the converse is not valid as the next example shows. 


Example 3.2.9. We model the experiment of rolling a fair die by the probability space 
(R, P(R), P), where P({1}) = - - - = P({6}) = 1/6 and P({x}) = 0 whenever x #1, ... , 6. 
If X : R > Ris defined by X(s) = s? then, of course, Px is discrete. Indeed, we have 
Px(D) = 1, where D = {1, 4, 9, 16, 25, 36}. On the other hand, X does not attain values in 
a countably infinite set; its range is [0, oo). 


Remark 3.2.10. If we look at Example 3.2.9 more thoroughly, then it becomes immedi- 
ately clear that the values of X outside of {1, ... ,6} are completely irrelevant. With a 
small change of X, it will attain values in D. More precisely, let X(w) =1ifw #1, ... ,6 
and X(k) =k?,k=1,...,6; then X g X and X has values in {1, 4, 9, 16, 25, 36}. 

This procedure is also possible in general: if Py is discrete with Py(D) = 1 for some 
countable set D, then we may change X to X such that X a XandX:Q-D. Indeed, 
choose some fixed dy € D and set X(w) = X(w) if w « X"(D) and X(w) = do otherwise. 
Then Px = Pz and X has values in D. 


Convention 3.1. Without losing generality we may always assume the following: if a 
random variable X has a discrete probability distribution, that is, P{X « D} = 1 for some 
finite or countably infinite set D, then X attains values in D. 


The second type of random variables we investigate is that of continuous ones.’ 


Definition 3.2.11. A random variable X is said to be continuous provided that 
its distribution Py is a continuous probability measure. That is, Py possesses a 
density p. This function p is called the density function or, in short, density of 
the random variable X. 


Remark 3.2.12. One should not confuse the continuity of a random variable with the 
continuity of a function as taught in Calculus. The latter is an (analytic) property of a 
function, while the former is a property of its distribution. Moreover, whether or not 


3 The precise notation would be “absolutely continuous”; but for simplicity let us call them 
“continuous.” 
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a random variable X is continuous depends not only on X, but also on the underlying 
probability space. 


Remark 3.2.13. Another way to express that a random variable is continuous is as 
follows: there exists a function p : R — [0, 00) (the density of X) such that 


t 
Pw ¢ 0: X(w) <= PIX <= [ pooex, teR, 
or, equivalently, for all real numbers a < b, 
b 
Pw eQ:a<X(w)<b}=Pla<X<b}= [ par. 
a 


How do we determine the probability distribution of a given random variable? To 
answer this question, let us first consider the case of discrete random variables. 

Thus, let X be discrete with values in D = {x,, x2, ...} c R. Then, as observed 
above, it follows Px(D) = 1, and, consequently, Px is uniquely determined by the 
numbers 


Dj = Px({y}) = PIX =x} = Plwe QO: XW) =x}, fHl2,.... (3.3) 
Moreover, for any B € R it follows that 


P{w « Q: X(w) « B} = Px(B) = ) pj. 


xj eB 


Consequently, in order to determine Px for discrete X it completely suffices to determ- 
ine the p;s defined by eq. (3.3). If we know (p;);>1, then the probability distribution Py 
of X is completely described. 


Remark 3.2.14. In the literature, quite often, one finds a slightly different approach 
for the description of Py. Define p : R = [0, 1] by 


p(x) =P{xX=x}, xeR. (3.4) 


This function p is then called the probability mass function of X. Note that p(x) = 0 
whenever x ¢ D. This function p satisfies p(x) > 0, >>,<p p(x) = land 


P{X « B}= ) p(x). 


xeB 


In this setting, the numbers p; in eq. (3.3) coincide with p(x;). 
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Example 3.2.15. Roll a fair die twice. Let X on {1, ... , 6}? be defined by 
X(w) = X(W1, W2) = 1 + W2, WY = (1, 2). 


Which distribution does X possess? 

Answer: The very first question one has to answer is always about the possible 
values of X. In our case, X attains values in D = {2,...,12}, thus it suffices to 
determine 


Px({k}) = P{X = k} = P{(w,, W2) €Q: X(w4, W) = ki, k= Perey Par 


One easily gets 


—#€GD) 1 
36 36 

— #0,2,2,0) 2 

36 36 


Px({2}) = P{(wi, w2) : wy + w2 = 2} 


Px({3}) = P{(wi, wo) : w1 + W2 = 3} 


#(((1,6),...,(6,D) _ 6 


Px({7}) = P{(w1, w2) : wy + w2 = 7} = 36 36 


_ #66) _ 1 


Px({12}) = P{(wi, w2) : w1 + w2 = 12} 36 36” 


hence Px is completely described. For example, it follows that 


1 2 3 1 


P{X < 4} = Px((-c0, 4]) = Px({2}) + Px({3}) + Px({4}) = 36 + 36 a 36 6° 


Example 3.2.16. A coin is labeled on one side by by “0” and and on the other side by 
“1” and biased as follows: for some p « [0,1], number “1” shows up with probability 
p, thus “0” with probability 1 — p. We toss the coin n times. The result is a sequence 
w = (w, ... , Wy), Where w; € {0, 1}, hence the describing sample space is 


Q = {0, 1}" = {w = (wy, ... , Wn) : wi € {0, }}. 


Fori < nlet X; : Q = R be defined by X;(w) := w;. That is, X;(w) is the value of the ith 
trial. What distribution does X; possess? 

Answer: In Example 1.9.11 we determined the probability measure P on P(Q), 
which describes the n-fold tossing of a biased coin. This probability measure was 
given by 


P(w}) = pk(-p)"*, k= 7 where W=(Wj,...,Wn). (3.5) 
jel 
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The random variable X; only attains the values “O” and “1.” Thus, in order to determine 
Px, it suffices to evaluate Px,({O}) = P{w ¢ Q : w; = O}. Let w € O be a sequence with 
w; = O. Then it may contain the value “1” at most n-1 times. Given k < n-1, there are 
exactly iy such sequences w with w; = 0 and with k times “1.” Therefore, we obtain 


n-1 


Px,({O}) = Plw € Q: w= 0}= ) > PlweQ: a; =0, W,+-+++Wyn=k} 
k=0 


n-1 n-1 
> (", ‘) pk el a le = (1 -p) >. i ‘) pk (1 api 
k=0 k=0 


= (1-p)[p+(1-p)|"' =1-p. 


Of course, this also implies Px,({1}) = p. 


Remark 3.2.17. Note that all X;, ... , X, possess the same distribution, that is, 


Summary: Let X : O > R bea discrete random variable. In order to describe its 
distribution Px, two things have to be done: 

(1) Determine the finite or countably infinite set D c R for which X : QO > D. 

(2) For each x € D evaluate 


Px({x}) = P{X = x} = P{w ¢ O: X(w) = x}. 
If B C R, then it follows 


P{X € B} = 2 Px({x}) = P{X = x}. 


xeBnD xeBnD 


How do we determine the probability distribution of a random variable if it is con- 
tinuous? For each x ¢ R, P{X = x} = 0, hence the values of P{X = x} cannot be used 
to describe Px as they did in the discrete case. Consequently, a different approach is 
needed, and this approach is based on the use of distribution functions. 


Definition 3.2.18. Let X bea random variable, either discrete or continuous. Then 
its (cumulative) distribution function Fy : R > [0,1] is defined by 


Fx(t) := Px((-co, t]) = P{X<t}, teR. (3.6) 
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Remark 3.2.19. Observe that for discrete and continuous random variables the distri- 
bution function equals 


t 


FeO = Sop and (= f pd ax 
xgst ~co 
respectively. Here, in the discrete case, the xjs and pjs are as in eq. (3.3), while p 
denotes the density of X in the continuous case. 
Furthermore, note that Fy is nothing else than the distribution function of the 
probability measure Py, as it was introduced in Definition 1.71. Consequently, it pos- 
sesses all properties of a “usual” distribution function as stated in Proposition 1.79. 


Proposition 3.2.20. Let Fy be defined by eq. (3.6). Then it possesses the following 
properties. 

(1) Fy is nondecreasing. 

(2) It follows Fx(—0co) = 0 as well as Fx(oo) = 1. 

(3) Fy is continuous from the right. 


Furthermore, if t « R, then 


P{X = t} = Fy(t) — Fy(t- 0). 


In particular, if X is continuous, then Fx is a continuous function from R to [0, 1]. 


Remark 3.2.21. Note that the converse of the last implication does not hold. Indeed, 
there exist random variables X for which Fy is continuous, but X does not possess 
a density. Such random variables are said to be singularly continuous. These are 
exactly those random variables for which the probability measure Py is singularly 
continuous in the sense of Remark 1.716. 


The next result shows that under slightly stronger conditions about Fy a density of X 
exists. 


Proposition 3.2.22. Let Fy be continuous and continuously differentiable with the ex- 
ception of at most finitely many points. Then X is continuous with density p(t) = a Fy(t). 
Hereby the values of p may be chosen arbitrarily at points where the derivative does not 
exist; for example, set p(t) = 0 for those points. 


Proof: The proof follows from the corresponding properties of distribution functions 
for probability measures. Recall that Fx is the distribution function of Px. a 
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The previous proposition provides us with a method to determine the density of a 
given random variable X. Determine the distribution function Fy and differentiate it. 
The obtained derivative is the density function we are looking for. 

The next three examples demonstrate how this method applies. 


Example 3.2.23. Let P be the uniform distribution on a sphere K of radius 1. That is, 
for each Borel set B ¢ B(IR2) we have 


vol(BnK) _ volo(Bn K) 
volo(K) 1d . 


P(B) = 


Define the random variable X : R* — R by X(x1, x2) := x1. Of course, we have F(t) = 0 
whenever t < —1 and Fy(t) = 1 whent > 1. Thus, it suffices to determine Fy(t) if 
-1<t< 1. For those t ¢ R we obtain 


volo(S; nK 
Fy(t) = BEO® 


where S; is the half-space {(x;, x2) € R? : x; < th. 


Figure 3.1: The intersecting set between K and the half-space S;. 


If |t| < 1, then 
t 
volo(S; a K) -2 [ Vi-#ax, 
=1 
hence, 
; t 
Fx(O) = = [vir8ax, 
“1 


and by the fundamental theorem of Calculus, we finally get 


d 2 
= —Fy(t) = 1-¢?, <1. 
p= SFx(@==VI-P, Iti< 
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Summing up, the random variable X has the density p with 


2/4 _ a 
wo={* pose =| (3.7) 


0) :|t}>1. 


Example 3.2.24. The probability space is the same as in Example 3.2.23, but this time 
we define X by 


X(%4,%2) = x2 +23, (4, %2) € R’. 


Of course, it follows Fy(t) = 0 if t < O while Fy() = 1ift > 1. Take t ¢ [0, 1]. Then 


_ vol(K()) _ fx 


= ‘i 
vol(K(1)) 2 : 


Fx(t) 


where K(t) denotes a sphere of radius t. Differentiating Fy with respect to t gives the 
density 


(y= {20 Osts! 
P O : otherwise 


Example 3.2.25. Let P be the uniform distribution on [0,1] and define the random 
variable X by X(s) = min{s, 1 — s}, s « R. Find the probability distribution of X. 
Answer: It is not difficult to see that 


P(X <th}=0 if t<O and P{X<t}=1 if t>1/2. 
Thus it remains to evaluate F(t) for 0 < t < 1/2. Here we obtain 


Fy(t) = P{X < t}=P{s ¢ [0,1]: 0<s<tor 1-t<s<t} 
= P{s¢ [0,1]: 0<s<t}+P{s¢[0,1):1-t<s<1}=2t. 


Differentiating gives F,(d) = 2if 0 < t < 1/2 and Fx(¢) = 0 otherwise. Hence Py is the 
uniform distribution on [0, 1/2]. 


Summary: To determine the density of a continuous random variable, proceed as 
follows: 

(1) Determine Fy(t) = P{X < ft}. 

(2) Differentiate Fy. Then the derivative p(t) = F;(t) is the desired density. 
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3.3 Special Distributed Random Variables 


We agree upon the following notation: a random variable X is said to be ABC- 
distributed (or distributed according to ABC) if its probability distribution is a prob- 
ability measure of type ABC. For example, a random variable is By,»-distributed (or 
distributed according to Bn,p) if Px = Bn,p, that is, if 


n 


Pa=K= (7 


)pka-ps k=0,...,n. 


In this way we define the following random variables of special type: X is 
1. uniformly distributed on {x;, ... , xy} if 


P(X =x}=--- =P(X=xy} 7 


2. Poisson distributed or Pois,-distributed if 


AK 
PIK=K}=Te"*, k=0,1,... 


3. hypergeometric distributed if 


My (N-M 
P{X = m} = (im) (nn) m=0,...,n, 


(n) 


4. Gp-distributed or geometric distributed if 


P{X =k}=p(i-p)K, k=1,2,..., 


5. B, p-distributed or negatively binomial distributed if 
k-1 k-1 
Pax = k= ( ) pra- pen = ( )pra-pe", k=n,n+1,.... 
k-n n-1 


Remark 3.3.1. In view of Convention 3.1 we may suppose that all random variables 
of the preceding type are discrete. More precisely, we even may assume that X has 
values in the (at most countably infinite) set D with Px(D) = 1. For example, if X is 
By,p-distributed we may suppose that X has values in {0, ... , n}. 


In quite similar way we denote special distributed continuous random variables. A 
real-valued random variable X is said to be 
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1. uniformly distributed on [a, f] if Px is the uniform distribution on [a, B]. That is, 
if [a, b] ¢ [a, B], then 


b-a 
B-a’ 
2. normally distributed or /V(y, 07)-distributed if 


P{a<X<b}= 


b 


1 2/92 
Pla <X <b} = —— [| eo WR gy, 
al 
a 


3. standard normally distributed if it is \/(0, 1)-distributed, that is, 


b 
1 2 
Pla<X<b -— | eFax, 
{ i om 


4. gamma distributed or I’, ,-distributed if for 0 < a < b < 00 
x @ 
Pla < X < b}= ——— | xf e*/* qx, 
iasts0- arg | 
a 


5. E,jn-distributed or Erlang distributed if it is [,,,-distributed, that is, for0 < a < 
b<o 
b 
A" ay 
P{a < X < b} = ——_ jee dx, 
(n-1)! 


a 


6. E,-distributed or exponentially distributed if for0 <a<b< 
b 
PlasX<b}=A f eM dx=eM—e%, 
a 


7. Cauchy distributed if 


b 
1 dx 1 
P{a<X<b}=— / = — [arctan b - arctan a] . 
mj) 1+x* at 
a 
Remark 3.3.2. Ifa random variable X possesses a special distribution, then all prop- 
erties of Py carry over to X. For example, in this language we may now formulate 
Poisson’s limit theorem (Proposition 1.4.22) as follows. 
Let X, be Bn,p,-distributed and suppose that np, ~ A > 0 asn — oo. Then 


dim P{Xn = k} = P{X = kj, k=0,1, 0.5 


where X is Pois,-distributed. 
Or if X is gamma distributed, then P{X > 0} = Px((0, o0)) = 1, and so on. 
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Remark 3.3.3. A common question is how does one get a random variable X pos- 
sessing a certain given distribution. For example, how do we construct a binomial 
or a normally distributed random variable? Suppose we want to model the rolling of 
a die by a random variable X, which is uniformly distributed on {1, ... , 6}. The easi- 
est solution is to take QO = {1, ... ,6} endowed with the uniform distribution P and 
define X by X(w) = w. But this is not the only way to get such a random variable. One 
may also roll the die n times and choose X as the value of the first (or of the second, 
etc.) roll. In a similar way random variables with other probability distribution may 
be constructed. Further possibilities to model random variables will be investigated in 
Section 4.4. 


Summary: There are two ways to model a random experiment. The classical ap- 
proach is to construct a probability space that describes this experiment. For 
example, if we toss a fair coin n times and record the number of “heads,” then this 
may be described by the sample space {0, ... ,n} endowed with the probability meas- 
ure B, 1/2. Another way to model a certain random experiment is to choose a random 
variable X so that the probability of the occurrence of an event B ¢ R equals P{X « B}. 
For example, the above experiment of tossing a coin may also be described by a 
binomial distributed random variable X (with parameters n and 1/2). The great ad- 
vantage of the second approach is that random variables allow algebraic operations. 
For example, they can be added, multiplied, or linearly combined. We will use this 
advantage extensively in the following sections. 


3.4 Random Vectors 


Suppose we are given n random variables X;, ... ,X, defined on a sample space Q. 
Our objective is to combine these n variables into a single variable. More precisely, we 
will investigate the following type of vector-valued mappings. 


Definition 3.4.1. Let X bea mapping from 0 > R” represented as 
X(w) = (Xi(w), ... ,Xn(w)), we. 
Then, X is said to be an (n-dimensional) random vector or vector valued ran- 
dom variable, provided that each of the X;s is a (real-valued) random variable. 
The random variables X;, 1 <j <n, are called coordinate mappings of X. 
Instead of X , we may also write (X;, ... , Xn), that is, 


(X%1, ... ,Xn)(w) = (Xy(w), ... ,Xn(w)), wed. 
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A random vector X maps Q into R"”, that is, we assign to each observed w € Q a vector 
X(w). The mapping X is again fixed and nonrandom. The randomness of X(w) is caused 
by the input. 


Example 3.4.2. Roll a die two times. Let X; be the maximum value, X2 the minimum 
and X3 the sum of both rolls. The three-dimensional vector X= (X1, X2, X3) maps 
Q = fi, ...,6} into R*. For example, the pair (2,5) is mapped to (5,2, 7) or (5, 6) 
to (6, 5, 11). 


Example 3.4.3. Suppose there are N people in an auditorium. Enumerate them from 1 
to N and choose one person according to the uniform distribution on {1, ... , N}. Say 
we have chosen person k. Let X;(k) be the height of this person and X>(k) his or her 
weight. As a result, we get a random two-dimensional vector (X1(k), X2(k)). 


Example 3.4.4. We place n balls into m urns successively. Hereby, each urn is equally 
likely. If X; denotes the number of balls in urn j, then we get an m-dimensional vec- 
tor X = (X1, ... , Xm). Observe that the values of X lie in the set D = {(ky, ...,km) : 
ky+ +++ +km=n}c NU. 


Remark 3.4.5. The preceding examples suggest that the values of the coordinate map- 
pings depend on each other. For instance, in Example 3.4.3 larger values of X; make 
also those of X2 more likely and vice versa. A basic aim of the following sections is to 
confirm this guess, that is, we want to find a mathematical formulation that describes 
whether or not two or more random variables are dependent or independent. 


3.5 Joint and Marginal Distributions 


The values of the vector X are randomly distributed in R”. Consequently, as in the case 
of random variables, events of the form {X ¢ B} occur with certain probabilities. But, 
in contrast to the case of random variables, the event B is now a subset of R”, not 
of R as before. More precisely, for events B ¢ R" we are interested in the following 
quantity’: 


P{w <Q: X(w) < BY = Plw <Q: (Xi(w), ... ,Xn(w)) € BY. (3.8) 


The next proposition gives the exact formulation of the problem. 


4 For random vectors X and B « B(R") it follows that X-(B) € A. This can be proved by similar 
methods as we used in the proof of Proposition 3.1.6. Thus, if B ¢ B(R"), then eq. (3.8) and also eq. (3.9) 
are well-defined. 
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Definition 3.5.1. Let X : Q > R” bea random vector with coordinate mappings 
X, ... , Xn. For each Borel set B « BUR") we set 


P3(B) = P(x,,... X,)(B) = PAX € B}. (3.9) 
The mapping P; from B(R") into [0, 1] is said to be the probability distribution, 
or, in short, the distribution of X. Often, P; = P(x,,__,x,) will also be called the 
joint distribution of X,, ... , Xp. 
In eq. (3.9) we used the shorter expression 


P{X ¢ B} = Plwe Q: X(w) « B}. 


As for random variables the following is also valid in the case of random vectors. 
Proposition 3.5.2. The mapping P; is a probability measure defined on B(R"). 


Proof: The proof is completely analogous to that of Proposition 3.2.2. Therefore, we 
decide not to present it here. a 


Let us evaluate P;(B) for special Borel sets B ¢ R". If Q is a box in R" as in eq. (1.65), 
that is, for certain real numbers q; < b; we have 


Q=[ai, bi] x - - - x [an, br] , 
then it follows that 
P3(Q) = P{X « Q}= PlweQ: a1 < Xi(w) < bi, ... ,an < Xn) < bah. 
The last expression may also be written as 
Pla, < X; <b, ...,Qn < Xn < by}. 
Hence, for each box Q = [aj, by] x - - - x [an, by], we obtain 


Px(Q) = Pla; <X,< by, ...,Qn<Xn < by}. 
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Thus the quantity P;(Q) is the probability of the occurrence of the following event: X; 
attains a value in [a;, b;], and at the same time X> attains a value in [a, b2|, and so 
on up to X;, attains a value in [dp, by]. 


Example 3.5.3. Rolla fair die three times. Let Xi, X2, and X3 be the observed values in 
the first, second, and third roll. If Q = [1, 2] x [0, 1] x [3, 4], then it follows 
1 
Px(Q) = P{X, € {1, 2}, X2 = 1, X3 € {3, 4h} = 54 . 
Remark 3.5.4. The previous considerations can easily be generalized to sets B ¢ R" of 
the form B = B, x - - - x B, with B; « B(R). Then 


P;(B) = P{X; € Bi, ... ,Xn € Bu} (3.10) 


Next we introduce the notion of marginal distributions of a random vector. 


Definition 3.5.5. Let X = (X;, ... ,X,) be a random vector. The n probability 
measures Py, to Px, are called the marginal distributions of X. 


Observe that each marginal distribution Px, is a probability measure on B(R), while 
the joint distribution Piy,,__x,) is a probability measure defined on B(R"). 

In this context, the following important question arises: does the joint distribution 
determine the marginal distributions and/or can the joint distribution be derived from 
the marginal ones? 

The next proposition gives the first answer. 


Proposition 3.5.6. Let X = (X1, ... ,Xn) be a random vector. If 1 <j < nand B « B(R), 
then it follows 


Px,(B) = Po, .. x, (R x + + - - xR). 


x Bx.:- 
——’ 
J 
In particular, the joint distribution determines the marginal ones. 


Proof: The proof is a direct consequence of formula (3.10). Let us apply it to B; = Rif 
i # jand to B; = B. Then, as asserted, 


P(x,,...,X(R x +--+ x Box +++ xR) 
j 


= PIX €R,...,X)€B,...,Xn €R} = P(X; € B} = Px(B). a 


The question whether or not the marginal distributions determine the joint distribu- 
tion is postponed for a moment. It will be investigated in Example 3.5.8 and, more 
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thoroughly, in Section 3.6. Before investigating this problem, let us derive some con- 
crete formulas to evaluate the marginal distributions. Here we consider the two cases 
of discrete and continuous random variables separately. 


3.5.1 Marginal Distributions: Discrete Case 


To make the results in this subsection easier to understand, we only consider the case 
of two-dimensional vectors. That is, we investigate two random variables and show 
how their distributions may be derived from their joint one. We indicate later on how 
this approach extends to more than two random variables. 

In order to avoid confusing notations with many indices, given a two-dimensional 
random vector, we denote its coordinate mappings by X and Y and not by X; and X>. 
This should not lead to mix-ups. Thus, we investigate the random vector (X, Y) with 
joint distribution P(x y) and marginal distributions Py and Py. This vector acts as 


(X, Y)(w) = (X(w), Y@)), wea. 
Suppose now that X and Y are discrete. Then, there are finite or countably infinite sets 
D = {x1,X2,...}and E = {y1, yo, ...} such that X : Q + Das wellas Y : Q = E. Con- 
sequently, the vector (X, Y) maps Q into the (at most countably infinite) set D x E c R’. 
Observe that 
DxE={(%j, yj) : ij= 12 ones 

hence Px, y) is discrete as well and uniquely described by the numbers 

Py = Panto y))=P&=x,Y=yh, i7=12,.... (3.11) 


More precisely, given B ¢ IR?, then 


Poxy(B) = PAX, Y)eB}= py. 
{G):Q,yj)eB} 


We turn now to the description of the marginal distributions Py and Py. These are 
uniquely determined by the numbers 


qi = Px(ha}) = PIX =x} and 7; := Py({yj}) = PLY =y;}. (3.12) 
In other words, if B, C ¢ R, then it follows 


Px(B)=P{X¢B}= 5° qj and Py(C)=P{YeC}= )> 4. 
{i:x;<B} {:yjeC} 
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The next proposition is nothing else than a reformulation of Proposition 3.5.6 in the 


case of discrete random variables. 


Proposition 3.5.7. Let the probabilities pj, q;, and r; be defined by egs. (3.11) and (3.12), 
respectively. Then the qis and rjs may be evaluated by the following equations: 


gi= > Dy for i=1,2,... and = > Dy for’ J = 1,2). ca< « 


jel i=1 


Proof: As already mentioned, Proposition 3.5.7 is a direct consequence of Proposition 
3.5.6. But for better understanding we prefer to give a direct proof. 
By virtue of the o-additivity of P it follows 


qi = PIX = x} = P(X =, Ye E}=P[X =x, ¥« Un 
jel 


- Pix =x, ¥ eh} - > Pix- Xi, Y= yi} = op. 
jel 


jel j=l 


This proves the first part. The proof for the rjs follows exactly along the same line. 


Here, one uses 


rj = PLY =yj} = P(X €D,Y=yjb= PX =x%,Y =y= >) py. 


This completes the proof. 


The equations in Proposition 3.5.7 may be represented in table form as follows: 


Y\X Xi X2 X3 

Mv |\Pu Pu Pru - °|N 

Y2 |Pi2 P22 P32 ° *\r2 

¥3 |P13/ P23 P33 **|P3 
nN @2 @ - 1 


The entries in the above matrix are the corresponding probabilities. For example, the 
entry p32 is put into the row marked by x3 and into the column where one finds y> at the 
left-hand side. This tells us p32 is the probability that X attains the value x3 and, at the 
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same time, Y equals y2. At the right and at the lower margins,° one finds the corres- 
ponding sums of the columns and of the rows, respectively. These numbers describe 
the marginal distributions (that of X at the bottom and that of Y at the right margin). 
Finally, the number “1” at the right lower corner says that both the right column and 
bottom row have to add up to “1.” 


Example 3.5.8. There are four balls in an urn, two labeled with “O” and another 
two labeled with “1.” Choose two balls without replacing the first one. Let X be the 
value of the first ball and Y that of the second. Direct calculations (use the law of 
multiplication) lead to 


, P{x=0,Y=1} 


AIP wle 


P{X =0,Y =O} = 


P{X =1, Y = 0} , P{X=1Y=]}= 


Wile DIR 


In tabular form this result reads as follows: 


yo 1) | 
o/f ai 
14 4B 
on 


Now suppose that we replace the first ball. This time we denote the values of the 
first and second ball by X’ and Y’, respectively. The corresponding table may now be 
written as follows: 


Y\xX10 1 
oF a3 
1k ap 
ir 


Let us look at Example 3.5.8 more thoroughly. In both cases (nonreplacing and repla- 
cing) the marginal distributions coincide, that is, Py = Py and Py = Py:. But, on the 
other hand, the joint distributions are different, that is, we have Py y) # Pyy,y. 


Conclusion: The marginal distributions do not, in general, determine the joint dis- 
tribution. Recall that Proposition 3.5.6 asserts the converse implication: The marginal 
distributions can be derived from the joint distribution. 


5 This explains the name “marginal” for the distribution of the coordinate mappings. 
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Example 3.5.9. Roll a fair die twice. Let X be the minimum value of both rolls and Y 
the maximum. Then, if k,] =1, ... , 6, it is easy to see that 


0 : k>l 


PxX=kY=f={2 : kel 
to: k<l 


Hence, the joint distribution in table form looks as follows: 


Y\X |1 2 3 4 5 6 
il 1 
i 2 6 6 @ 6 © [2 
sf 1 3 
ze 2 oO oO @ 1S 
1 1 1 5 
2 2 2.2. 6. @ |= 
a 1 1 Hl 7 
4 |i is is 3% O O [36 
1 1 1 if 1 9 
5 18 18 18 18 36 0) 36 
6 1 1 ‘l 1 1 1 11 
18 18 18 18 18 36 36 
11 9 iA 5 3 1 1 
36 36 36 36 36 36 


If, for example, B = {(4,5), (5,4), (6,5), (5, 6)}, then the values in the table imply 
Px,y)(B) = 1/9. In the same way, one gets P{2 < X < 4} = (9+ 7+ 5)/36 = 7/12. 


To finish, we shortly go into the case of more than two discrete random variables. 
Thus, let X;, ... ,Xn be random variables with X; : © > D;, where the sets D; are either 
finite or countably infinite. The set D defined by 


D=D, x arn x Dn = {0%1, -.. Xn), x; € Dj} 


is at most countably infinite and X= D, Consequently, P; is uniquely described 
by the probabilities 


Dxy, Xp = P{X, = Mig ace Xn = ahs xj € D;. 


Proposition 3.5.10. For1<j<nandx«D,, 


P{X; =x}= pa os > ae 2 ‘> Pq XX Xn * 


xy€D, 5-1€Dj_-4 X41 Dj xn€Dn 


Proof: The proof is exactly the same as that of Proposition 3.5.7. Therefore, 
we omit it. a 


Next, we want to state an important example that shows how Proposition 3.5.10 
applies. To do so we need the following definition. 
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Definition 3.5.11. Let n and mbe integers with m > 2 and let p;, ... , Dm be certain 
success probabilities satisfying p; > Oand p+ ---+pm = 1. Anm-dimensional ran- 
dom vector X = (X;, ... ,Xm) is called multinomial distributed with parameters 
nand py, ... , Dm if, whenever k, + - - - +k» =n, then 


n 
PU = iy os Xn = hind = ( pe... phn, 
epee ed G4 


Equivalently, a random vector X is multinomial distributed if and only if 


its probability distribution P; is a multinomial distribution as introduced in 
Definition 1.4.12. 


Remark 3.5.12. The m-dimensional random vector X in Example 3.4.4 is multinomial 
distributed with parameters n and p; = 1/m. That means 


1 n 
PIX, =k, ... Xm= km} = i‘ 2), Mees teen, 
ky, ... km m 


Example 3.5.13. Let X= (X;, ... ,Xm) be a multinomial random vector with paramet- 
ers nand py, ... , Pm. What are the marginal distributions of x? 

Answer: To simplify the calculations, we only determine the probability distribu- 
tion of Xj. The other cases follow in the same way. First note that in the notation of 
Proposition 3.5.10 


Dk ke = as tin) Pi Bn yt tka 
ee 0) tkt---+kn#én 


Consequently, Proposition 3.5.10 leads to 


n n 
P{Xm =k}= So ++) SO Dia. kmek 
ky=0 


km-1=0 
n! k k 
= y 1... pkm-1,k 
ky! ar ciate km-1! kK! Py Pm-1 Pm 
kyt+++++km-1=n-k 
n! (n—k)! k 
) 1 m-1 
= —— <P ee Dp. 


= pt : 
W(n—kie™ ae pel m-1 
k! (n k)! en ee k! kn-1! 


= n k n-k ky eae km-1 
” (;)P ; », @ 7 ‘in? ea 


t+++++km—1=n-k 


7 ({)Ph (i+ +--+ +pma)"* = (,)Ph (1-pm)"*. 


Hereby, in the last step, we used the multinomial theorem (Proposition A.16) with m-1 
summands, with power n - k and entries p;, ... , Dm-1- 
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Thus X,, is binomial distributed with parameters n and pm. In the same way one 
gets that each X; is By,p,-distributed. 


Remark 3.5.14. The previous result can also be seen more directly without using Pro- 
position 3.5.10. Assume we place n particles into m boxes, where p; is the probability to 
put a single particle into box j. Fix some j < mand and let success occur if a particle is 
placed into box j. Then X; equals the number of successes, hence it is By,p,-distributed. 
Note that failure occurs if the particle is not placed into box j, and the probability for 
this is given by 1 — p; = ve Di. 

yy 


3.5.2 Marginal Distributions: Continuous Case 
Let us turn now to the continuous case. Analogous to Definition 3.2.7, a random vector 


is said to be continuous whenever it possesses a density.° More precisely, we suppose 
that a random vector shares the following property. 


Definition 3.5.15. A random vector X = (X1, ... , Xn) is said to be continuous if 
there is a function p : R" — Rsuch that, for all numbers, aj < bj, 1<j<n, 


by bn 
IPR SPXGESIU eee 1 <Xn-< bo} = f tae [ v0. woo iG) Cita coo CBR « 
ay an 
Equivalently, for all real numbers t;, ... , tn, 
ti tis 
IPDS S ths oo Kn stah= | tae [ 2%. soe ole) Chea oes Che 


The function p is called the density function of X or also the joint density of 
AGln coe 92 Se 


Remark 3.5.16. Observe that a random vector X is continuous if and only if its prob- 
ability distribution P; is so, that is, the joint distribution of X;, ... Xn, is a continuous 
probability measure on B(IR") in the sense of Definition 1.8.5. Moreover, its density 
function coincides with the density of P. 


6 The following is true: for continuous random variables the generated vector possesses a density. 
The proof is far above the scope of this book. Furthermore, we do not need this assertion because we 
assume X to be continuous, not the Xjs. 
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In the case of continuous random variables, the marginal distributions are evaluated 
by the following rule. 


Proposition 3.5.17. If arandom vector X = (X1, ... ,Xy) has density p : R" > R, then 
for eachj < nthe random variable X; is continuous with density 


pj%) = / coat ik p(... » Xj-15 Xj Xjs1 ...)dXy seed AXj+1 dxj-1 ba dx; . (3.13) 
—oco —o°o 


n-1integrals 


If n = 2, the above formula reads as 


co co 


pil) = [ rea) axe and p2(x2) = [ pea) a. 


—co —oo 


Proof: Fix an integer j < n. An application of Proposition 3.5.6 implies 


Px, (Ia, b]) = P;(R x .-+xfa,b]x --- xR) 


oe 
j 
°° b °° 
-f / ef lta, +n) de de 
~00 a —oo 
—— 
j 
b a 65 
-/|f _ [ p(... Xj-1y Xjs Xjurs ...)dXy we AX ja X41 de] dx; 
a = is 
b 
-| Dj(x;j) dx; 
a 


with p; defined by eq. (3.13). The interchange of the integrals was justified by Fubini’s 
theorem (Proposition A.5.5); note that p is a density, hence it is non-negative. Since 
the preceding equation holds for all real numbers a < b, the function p; has to be a 
density of Px,. This completes the proof. a 


Remark 3.5.18. Another way to formulate Proposition 3.5.17 is as follows: if the func- 
tion p : R" > Risa joint density of X;, ... ,Xy, then pj, ... , pn defined in eq. (3.13) 
are densities of the random variables X;, ... , Xn, respectively. 


Example 3.5.19. Choose by random a point x = (x1, x2, X3) in the unit ball of R*?. How 
are the coordinates x;, x2, and x3 distributed? 
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Answer: Let X = (X1, X2, X3) be uniformly distributed on the unit ball 
K = {(x1, Xo, X3) x7 +5 +4 < 1}. 
Then the joint density is given by’ 


3 
= xeK 
x)= 3 40 
px) {' xeK 
An application of Proposition 3.5.17 leads to p;(x;) = 0 whenever |x;| > 1 and, if |x;| < 1, 
then it follows that 


3 3 3 

= — dxydx. 1-x? 1-x?). 
rit) = = i nds = =X) = 71-29) 
XS 4x3 1-x? 


Hence, X, has the density 


3q-s*) : -l<s<l 
24 : <s< 
pals) | 0) : otherwise 


Of course, by symmetry X2 and X3 possess exactly the same distribution densities. 


Example 3.5.20. Suppose the two-dimensional random vector (X;, X2) has the density 
p defined by® 


8X1X2 : O<xX<xX<1 
X1,X2) = : 
P(x, X2) | 0) : otherwise 
Then, the density p, of X; is given by 
co 1 
p10q) = [ pose) dx2 = 8x, |» dx. = 404 -x7), O<m <1, 
—o° XL 


and p(x) = 0 if x; ¢ [0, 1]. 
In the case of po, the density of Xp, it follows that 


co x2 
P2(X2) = [ roux = 8% [xan =42, O0<x <1, 
~oo 0 


and po(x2) = 0 if x2 ¢ [0, 1]. 


7 Recall vol3(K) = $7. 
8 Check that p is indeed a probability density. 
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3.6 Independence of Random Variables 


The central question considered in this section is as follows: when are n given random 
variables independent? Surely everybody has an intuitive idea about the independ- 
ence or dependence of random values. But how do we express this property by a 
mathematical formula? Let us try to approach a solution of this problem with an 
example. 


Example 3.6.1. Rolla fair die twice and define the two random variables X; and X2 as 
a result of the first and second roll, respectively. These random variables are intuitively 
independent of each other. But what formulas do these express? Take two subsets 
By, Bz « {1, ... ,6} and look at their preimages A; = X,1(By) and Az = X;\(Bo). Then A; 
occurs if the first result belongs to B, while the same is true for Ap whenever the second 
result belongs to B2. For example, A; might be that the first result is an even number 
while A2 could occur if the second result equals “4.” The basic observation is, no mat- 
ter how B, and B2 were chosen, the occurrence of their preimages A; and A2 only de- 
pends on the first or second roll, respectively. Therefore, they should be independent 
(as events) in the sense of Definition 2.2.2, that is, the following equation should hold: 


P{X; € By, X> € Bo} = P(x;"(B) n X;'(B2)) = P(A, n A2) 
= P(A1) - P(A2) = P(X;"(B)) -P(Xp"(B2)) = POG € By} PE € By. 
This observation leads us to the following definition of independence. 
Definition 3.6.2. Let X;, ... ,X; be n random variables mapping © into R. These 


variables are said to be (stochastically) independent if, for all Borel sets B; ¢ R, 


P{X1 € Bi, ... ,Xn € Bn} = P{X% € Bi} - - - P{Xn € By}. (3.14) 


Remark 3.6.3. By virtue of Remark 3.5.4, eq. (3.14) may also be written as 
Pox,... Xn)(B1 x + + - Bn) = Px,(B1) - - -Px,(Bn), Bj ¢ BCR). 
Before proceeding further, we shortly recall Corollary 1.9.7. 


Corollary 3.6.4. Given n probability measures P,, ... , Py, defined on B(R), there exists 
a unique probability measure P on B(R"), the product measure denoted by P= P, ® - - - 
® Py, such that for all Borel sets Bj © R 


P(B; x - - - x By) = Py(By) - - - Pr(Bn)- (3.15) 
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Now, we are prepared to state the characterization of independent random variables 
by properties of their distributions. 


Proposition 3.6.5. The random variables X;, ... , Xn are independent if and only if their 
joint distribution coincides with the product probability of the marginal distributions. 
That is, if and only if 


P(x,,...,.Xn) = Px, ®--- @Px,. 


Proof: In view of Corollary 3.6.4, the product probability P of Px,, ..., Px, is the 
unique probability measure on B(R") satisfying 


P(B, x - + - x Bn) = Px,(Bi) - - - Px,(Bn), Bj €¢ B(R). 
On the other hand, by Remark 3.6.3, the X;s are independent if and only if 
Px, ...,.Xn)(B1 x + - -Bn) = Px,(Bi) -- - Px,(Bn), B; ¢ B(R). (3.16) 


Consequently, eq. (3.16) holds for all Borel sets B; ifand only if P(x, x,) is the product 
probability Px, ® - - - ® Px,. This completes the proof. o 


Corollary 3.6.6. If Xi, ...,Xn are independent, the joint distribution P,x,,.\x,) is 
uniquely determined by its marginal distributions Px,, ... , Py, 


n° 


Proof: Proposition 3.6.5 asserts Pcy,,x,) = Px, ® -- - @ Px,. Hence, the joint 
distribution is uniquely described by the marginal ones. a 


The next proposition clarifies the relation between the properties “independence of 
events” and “independence of random variables.” At a first glance the assertion looks 
trivial or self-evident, but it is not at all. The reason is that the definition of independ- 
ence for more than two events, as given in Definition 2.2.12, is more complicated than 
in the case of two events. 


Proposition 3.6.7. The random variables Xj, ... , X, are independent if and only if for 
all Borel sets B,, ... , By, in R the events 


X;1(By), ... ,X;'(Bn) 


are stochastically independent in (QO, A, P). 
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Proof ?: When are X;1(B;), ... ,X,!(Bn) independent? According to Definition 2.2.12 
this holds if for all subsets I ¢ {1, ... , n} 


P(()¥;70) = T] P@"@a). (3.17) 


ieI ieI 


On the other hand, by Definition 3.6.2, the Xi, ... , Xn are independent if 
n 
P (xe) = P{X; ¢ Bi, ... ,Xn € Bn} 
i=l 
= - [Po € Bi} = [Te X;\(B)). (3.18) 


Of course, eq. (3.17) implies eq. (3.18); use eq. (3.17) with I = {1, ... ,n}. But it is far 
from clear why, conversely, eq. (3.18) should imply eq. (3.17). As we saw in Example 
2.2.10, for fixed sets B; this is even false. The key observation is that eq. (3.17) has to be 
valid for all Borel sets B;. This allows us to choose the Borel sets in an appropriate way. 

Thus let us assume the validity of eq. (3.18) for all Borel sets in R. Given B; « B(R) 
and a subset J of {1, ... ,n} we introduce “new” Bj, ... , Bj, as follows: Bi = B; ifi « I 
and B; = Rifi ¢ I. This choice of the Bi implies X; 1(B)) = Q whenever i ¢ J. An 
application of eq. (3.18) to Bi, ... , By, leads to (recall X; 1B) = Oifi¢ I) 


n n 
P (Mx*@0) =P (x0) = | [ P(x; '(6)) = | [ P(x;"@a). 
ieI i=1 i=1 ieI 

This proves eq. (3.17) for any subset J of {1, ... ,n}. Hence, X;1(Bi), ... ,X;1(Bn) are 


independent as asserted. a 


Remark 3.6.8. To verify the independence of Xj, ... , Xn, it is not necessary to check 
eq. (3.14) for all Borel sets Bj. It suffices if this is valid for real intervals [a;, byl. In other 
words, X;, ... ,Xn are independent if and only if, for all a; < bj, 


Play <X1< bi, ... An < Xn < bn} 
= Play < X; < by} - - - Plan < Xn < by}. (3.19) 


Furthermore, it also suffices to choose the Borel sets as intervals (—0o, ti] for tj ¢ R,ive., 
X;, ... ,Xn are independent if and only if, for all t; « R, 


P{X; <t,...,Xn< tn} = PIX, < ti} tone P{Xn < tn}. 


9 The proposition and its proof are not necessarily needed for further reading. But they may be helpful 
for a better understanding of the independence of events and of random variables. 
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3.6.1 Independence of Discrete Random Variables 


As in Section 3.5.1, we restrict ourselves to the case of two random variables. The ex- 
tension to more than two variables is straightforward and will be shortly considered 
at the end of this section. We use the same notation as in Section 3.5.1. That is, the 
two random variables are denoted by X and Y, and they map Q into D = {x;, x2, ...} 
and E = {yj,y2, ...}, respectively. The joint distribution (X, Y) as well as the marginal 
distributions, that is, the distributions of X and Y, are described as in eqs. (3.11) and 
(3.12) by 


py=P{(X=x,Y=y}, gi=P{X=x} and r=P{Y=yj}. 
With these notations the following result is valid. 


Proposition 3.6.9. For the independence of two random variables X and Y, it is 
necessary and sufficient that 


Pi =Wi-tj, 1<1,f< 0c 


Proof: The assertion is an immediate consequence of Propositions 1.9.9 and 3.6.5. But, 
because of the importance of the result, we give an alternative proof avoiding the direct 
use of product probabilities; only the techniques are similar. 

Let us first show that the condition is necessary. Therefore, choose indices i and 
j, and put B, := {x;} and Bz := {y;}. Then {X ¢€ B,} occurs if and only if X = x;, and, in 
the same way, the occurrence of {Y ¢ Bz} is equivalent to Y = y,. Since X and Y are 
assumed to be independent, as claimed, 


pi = P{X =x, Y = yj} = P{X € By, Y € Bo} = P{X € By} - P{Y € Bo} 
= PIX = x;}-P{Y = yj} = qi-t;. 


To prove the converse implication, assume we have pj = qj - 1; for all pairs (i,j) of 
integers. Let B, and B2 be two arbitrary subsets of R. Then it follows 


P{X € BY ¢ By} =Puxy(BixB)= Yo py 
{()):04,yj)<BixB2} 


= > Gi Tj = > 2 GTi 


{(,):xjeB1, yjeBah {izxj<Bi} {j:yj<Ba} 
- ( 3 x) ( 3 n) = Px(B)) - Py(By) 
{i:xj¢Bi} {i:yj<B2} 


= P(X ¢ B,)- P(Y € Bp). 
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Since B, and B> were arbitrary, the random variables X and Y are independent. This 
completes the proof. a 


Remark 3.6.10. The previous proposition implies again that for (discrete) independ- 
ent random variables the joint distribution is determined by the marginal ones. 
Indeed, in order to know the pis, it suffices to know the qjs and rs. 


Let us represent the assertion of Proposition 3.6.9 graphically. It asserts that the ran- 
dom variables X and Y are independent if and only if the table describing their joint 
distribution may be represented as follows: 


Y\X Xj X2 X3 
Yi Mn gn gn: IN 
y2 M2 G2t2 932° + | 
y3 M13) «92r3) G3r3, + + 73 
a @2 gd -:--: iil 


Example 3.6.11. Proposition 3.6.9 lets us conclude that X and Y in Example 3.5.8 
(without replacing) are dependent while X’ and Y’ (with replacement) are inde- 
pendent. Furthermore, by the same argument, the random variables X and Y in 
Example 3.5.9 (minimum and maximum value when rolling a die twice) are dependent 
as well. 


Example 3.6.12. Let X and Y be two independent Pois,-distributed random variables. 
Then the joint distribution of the vector (X, Y) is determined by 


k+l 


A 
P{xX =k, Y=}= ia eA (kD eNoxNo. 


For example, applying this for Py y)(B) with B = {(k, J) : k = I} leads to 


2k 
An era 


(Coe 


P{X = Y}= DP kY=k= > 


Example 3.6.13. Suppose X and Y are two independent geometric distributed random 
variables, with parameters p and q, respectively. Evaluate P(X < Y). 
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Solution: By the independence of X and Y, 


P(X < Y)= > P(X =k, Y>h) = Y PK =k) PY 2h 
k=1 k=1 


= >i p-p)* Yo q-a)'* = pq ( yu -p\') ( > G- a") 
kel k=0 l=k+1 


I=k 
-pq ( ye - pi) ( ya 2 a!) -pq ( oe -p)a- 0) ( a 2 a!) 


_ p _ —p 
1-(1-p)Q-q) pt+q-pq 


Example of application: Player A rolls a die and, simultaneously, player B tosses two 
fair coins labeled with “O” and “1.” Find the probability that player A observes the 
number “6” for the first time strictly before player B gets “1” two times. 

Answer: Let {Y = k} be the event that player A observes his first “6” in trial k. 
Similarly, {X = k} occurs if player B has his first two “ones” in trial k. Then we ask 
for the probability P{Y < X}. Note that X is geometrically distributed with parameter 
p = 1/4, while the success probability for Y is q = 1/6. Hence, by the above calculations, 


1/4 A 
1/44+1/6-1/24 3° 


PLY <X}=1-P{X< Y}=1 


The next objective is to investigate in which cases two quite special random variables 
are independent. To this end we need the following notation. 


Definition 3.6.14. Let Q be a set and A ¢ Q. Then the indicator function 1, : 
O = R of A is defined by 


1:weA 


O:w¢A ey) 


1a) 3= | 


Let us state some basic properties of indicator functions. 


Proposition 3.6.15. Let (Q, A, P) be a probability space. 

(1) The indicator function of a set A ¢ Qis a random variable if and only if A « A. 

(2) IfA «A, then 1, is B,,)-distributed (binomial) where p = P(A). 

(3) IfA, B «A, then the random variables 1, and 1, are independent if and only if the 
events A and B are so. 
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Proof: Given t « R, the event {w ¢ 2: 1,(w) < this either empty, A‘ or Q in depend- 
ence of t < 0,0 <t <1ort > 1. Consequently, the set {w « Q: 1,(w) < thisin J for all 
t ¢ Rif and only if A‘ ¢ A. But this happens if and only if A € A, which proves the first 
assertion. 

To prove the second part we first observe that 1, attains only the values “O” and 
“1.” Since 


P{l, =1} = P{weQ: law) =1}= P(A) =p, 


it is B,)-distributed with p = P(A) as claimed. 
Let us turn to the last assertion. Given A,B ¢ A their joint distribution in table 
form is 


1p\la 0 1 
0. |)P(ASn BOIP(A 1 B)P(B) 
1 || P(A°nB) | P(AnB) | P(B) 
P(A‘) P(A) 


Consequently, by Proposition 3.6.9 the random variables 1,4 and 1g are independent 
if and only if the following equations are valid: 


P(A‘ n BS) = P(AS)-P(B), = P(AS nn B) = P(A‘) - P(B) 
P(A n B‘) = P(A)- P(B), P(A nB) = P(A)- P(B). 


Because of Proposition 2.2.7 these four equations are satisfied if and only if the events 
A and B are independent. This proves the third assertion. a 


Finally, let us shortly discuss the independence of more than two discrete random 
variables. Hereby we use the same notation as in Proposition 3.5.10, that is, the ran- 
dom variables X;, ... ,Xy satisfy X; : Q — Dj, where Dj is either finite or countably 
infinite. Then the following generalization of Proposition 3.6.9 is valid. Its proof is 
almost identical to that for two variables. Therefore, we omit it. 


Proposition 3.6.16. The random variables X,, ... ,Xn are independent if and only if for 
all Xj € D; 


P{X, =X, ... Xn = Xn} = P{X1 = xa} - - - P{Xn = Xn}. 
Example 3.6.17. Let us consider the problem of tossing a biased coin n times. The 


sample space is Q = {0, 1}", and the describing probability measure P is as in eq. (3.5). 
The random variables x; are defined as results of toss j. Then X; : Q — Dj, where 
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D; = {0,1}. If we choose arbitrary x; ¢ Dj, then either x; = 0 or x; = 1. Let k be the 

number of those x;, which equals 1, that is, k = x; + - - - +X». Formula (3.5) implies 
Pky = Xia i9 kee tal Pe Pp 


On the other hand, as shown in Example 3.2.16, the probability distribution of each X; 
satisfies 


P{Xj=O}=1-p and P{Xj)=1}=p. 
Since exactly k of the x;s are “1” and n — k are “O,” this implies 
P(X, = x} » - PEXn = Xn} = pX(- py". 
Summing up, for all x; « Dj, 


P(X = 2%, «+ An Sl =p dp) = Piya a) + PX al, 


that is, X;, ... ,X, are independent. 


3.6.2 Independence of Continuous Random Variables 
We will consider the question in which cases continuous random variables are in- 


dependent. Thus, let X;, ... ,X, be continuous random variables with distribution 
densities p;, ... , Pn, that is, for 1 < j < nand real numbers a < b 


b 
Px,([a, b]) = P{a < X; < b} = [>i dx. 


With this notation the independence of the X;s may be characterized as follows. 


Proposition 3.6.18. For random variables X), ... ,Xn with densities pi, ...,Dn we 
define a function p : R" + R by 


PO, ..- 5Xn) = pila) ~~ > PnO&n), Oa, .-- »Xn) € R". (3.21) 
Then the Xjs are independent if and only if p defined by eq. (3.21) is a distribution density 


of the random vector X = (Xi, ... , Xn). 


Proof: As in the discrete case, the result follows directly from Propositions 1.9.12 and 
3.6.5. Without using product probabilities, we may argue as follows. 
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First we observe that p defined by eq. (3.21) is a distribution density of X if and 
only if for all a; < bj 


Play < X < by, ... ,Qn< Xn < dn} 
by bn 


-| pak [ reo ae Dn(Xn) dXn 1. AX. (3.22) 


The right-hand side of eq. (3.22) coincides with 


by bn 
/ pilxidx, | - -- / Pa(%n) Oy 
ay n 


= Play < X, < bi}: - - Plan < Xn < by}. 
From this we derive that eq. (3.22) is valid for all a; < b; if and only if 
Pla, < X; < by, ...5Qn < Xn < bn} = Play < X; < by} a8 P{an <Xy < bn}. 


By Remark 3.6.8 this is equivalent to the independence of the X;s. This completes the 
proof. a 


Example 3.6.19. Throw a dart to a target, which is a circle of radius 1. The center of the 
circle is the point (0, 0) and (x1, x2) « K denotes the point where the dart hits the target. 
We assume that the point hit is uniformly distributed on K. The question is whether 
or not the coordinates x; and x2 of the point hit are dependent or independent of each 
other. 

Answer: Let P be the uniform distribution on K, and define two random variables 
X, and Xp by X1(x1, X2) = x, and X2(x;, x2) = x2. In this notation, the above question 
is whether the random variables X; and X2 are independent. The density p,; of X; was 
found in eq. (3.7). By symmetry, po, the density of X2, coincides with p;, that is, we have 


Py lees cae 2 /,_y2. 
p= [V3 als vad pate) |? 1- x3 x2] s1 


0 (x4) >1 0 : |X2| > 1. 


But p1(x1) - po(x2) cannot be a distribution density of P,x,,x,). Indeed, the vector ben 
(X1, X2) is uniformly distributed on K, thus its (correct) density is p with 


1 ee, 

= XP +XS <1 
XX) = 47 te 
PM, %2) {j : otherwise 


Thus, we conclude that X; and X2 are dependent, hence also the coordinates x; and x2 
of the point hit. 
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Example 3.6.20. We suppose now that the dart does not hit a circle but some rect- 
angle set R := [a,, Bi] x [a2, B2]. Again we assume that the point (x, x2) ¢ Ris uniformly 
distributed on R. The posed question is the same as in Example 3.6.19, namely whether 
x, and x2 are independent of each other. 

Answer: Define X; and X as in the previous example. By assumption, the vector 
X = (X1, Xo) is uniformly distributed on R, hence its distribution density p is given by 


— (x1,%>) ER 
X1,X) = { VoR®) 1,42 
PU%1, Xa) | 0 > (4, x2) ER. 
For the density p, of X; we get 
_ r _ B2- _ 1 
p10) ss D(X, X2) dx2 TOR) =e 


provided that a; < x, < f;. Otherwise, we have p(x) = 0. This tells us that X; is 
uniformly distributed on [a;, B,]. In the same way, we obtain for x2 € [a2, B2] that 


1 
X)= 
P2(x2) oa 
and p2(x2) = O otherwise. Hence, X> is also uniformly distributed, but this time on 
[a2, B2]. From the equations for p; and pz it follows that for the joint density p holds 


p04, X2) = pil) p02), (x41, X2) € R?. 


Consequently, by Proposition 3.6.18 the random variables X; and X2 are independent, 
and so are the coordinates x, and x2 of the point hit. 


Example 3.6.21. Let us look at Example 3.6.20 from the reversed side. Now we assume 
that the coordinates are uniformly distributed, not the vector. Thus let U;, ... , U, be 
independent random variables with U; uniformly distributed on the interval [a;, Bj], 
1 <j <n. Then the random vector Us (U,, ... , Un) is (multivariate) uniformly dis- 
tributed on the box K = [aj, Bi] x - - - x [@n, Bn]. This is an immediate consequence 
of Example 1.9.13 combined with Proposition 3.6.5. A direct proof of this fact, without 
using product measures, is as follows. 

The density of U; is p; = ie Lfa;,B)}> hence by Proposition 3.6.18 the joint density 


pof U is given by 


x=(xX1,...,X%X) EK, 


P(X) = Pile) «+» Pn(Xn) = |] [Bi - a) = 


jel 


1 
vol,(K) ’ 
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and p(x) = Oif x ¢ K. Therefore, 


vol,(K A B) 


ee ee n 
wag ee 


P{U eBh= | pi dx= 
| 


and U is uniformly distributed on K as asserted. 


Example 3.6.22. Let Xi, ...,X, be independent standard normally distributed. 
Which joint density does the vector X = (Xj, ... , Xn) possess? 
Answer: The densities p; of each of the X;s are 


al 
pj(x) = —e*?, xeR. 


J2n 


Consequently, by the independence of the X;s the joint density p equals 


1 2 2 
= ee = Oqt + txq)/2 
P(X) = pia) - + + Pan) (nyen® 
1 igi 
~ on? Be = Cay cies 


This tells us that Py = V(0,1)®" (cf. Definition 1.9.16) or, equivalently, X is n- 
dimensional standard normally distributed. 


Example 3.6.23. IfX;, ... ,X, are independent F,-distributed, then 


0 :t<O 
Ta Pane 


hence, the random vector X = (X1, ... , Xp) has the joint density 
p(t) =AMe Mat +) t= (th, ... th), F220, 


and p(t) = 0 if one of the ts is negative. 


3.7 *Order Statistics 


This section is devoted to a quite practical problem. Suppose we execute a random 
experiment n times so that different trials are independent of each other. The results 
of these trials are x;, ... , X,. For example, one may think of n different measurements 
of the same item, and x1, ... , X, are the observed values. After getting x;, ... , xX, we 
reorder them by their size. These “new” numbers are denoted by xj < --- < x7. In 
other words, the numbers are the same as before but in nondecreasing order. We now 
ask for the distribution of the ordered x;s. 
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The precise mathematical formulation of this problem is as follows: let X;, ... , Xn 
be n independent identically distributed random variables defined on a sample 
space Q. For each fixed w « Q, we choose a permutation 7, € S, (we use the notation 
of Section A.3.1), such that 


Xi (1 (W) Kee Xrtiy(n)(@) « (3.23) 


Of course, it may happen that there exists more than one permutation for which the 
inequalities (3.23) hold, namely if X;(w) = X;(w) for some i # j. In this case, we choose 
any of these permutations. Finally, for each w ¢ O we set 


XT (w) = Xp, 0)(@), -.. , Xp(@) = Xn, (@)- 


In this way’° we obtain random variables Xj satisfying X} < --- < X;. For example, it 
holds 


Xf = min{X), ..., Xn}, ...,X* = max{X, ... ,Xn}. 


Remark 3.7.1. It is worthwhile to mention that the Xjs are no longer independent nor 
identical distributed. 


Remark 3.7.2. For a better understanding of the procedure, let us look at the 
case n = 3. There exist 6 = 3! possible ways the X;(w)s may be ordered. For example, 
if X2(w) < X3(w) < X1(w), then set 7,,(1) = 2, m4(2) = 3, and m,(3) = 1 or, equival- 
ently, w ¢ A; where 7(1) = 2, m(2) = 3, and 7(3) = 1. Hence, in that example we have 
X}(w) = X2(w), X}(w) = X3(w), and X}(w) = X;(w). At the end, we get 6 subsets A, of 
OQ where 7, = 7 for a given 7 € S3, that is, on each of these six sets, the same type of 
reordering is applied. 


Definition 3.7.3. The ordered random variables X;, ... ,X; are called order 
statistics of X;, ... , Xp. 


Remark 3.7.4. Order statistics play an important role in Mathematical Statistics. For 
example, suppose at time t = O we switch on n light bulbs of the same type. Let us 
record the times 0 < t} < t} < ... < t}, where some of the n bulbs burns out. Then 
these times are nothing else than the order statistics of the life times t,, ... , t, of the 
first, second, and so on light bulb. 


10 Another way to describe the procedure of reordering is as follows. For each permutation 7 let 
A; ¢ be the set of those w € Q for which X,)(w) < -- - < X;q)(@), that is, where 7 = my. Then 
it follows X;(w) = X,()(w) whenever w ¢ A,. Note that there are at most n! different sets Az. 
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Before we state and prove the main result of this section, let us recall that the X;s 
are assumed to be identically distributed. Consequently, all of them possess the same 
distribution function F. That is, for all j < n, we have 


F(t) = P{Xj< th}, teR, 


Proposition 3.7.5. Let X;, ... , Xn be independent identically distributed random vari- 
ables with distribution function F. Then for each k < n we have 


PIX; <t}= >> (j) Fo! Q-F@)"', teR. (3.24) 
i=k 


Proof: Fix t « R. When does the event {X; < t} occur? To answer this, fori < n 
introduce disjoint sets A; as follows: the event A; occurs if and only if exactly i of the 
Xjs attain a value in (oo, t]. More precisely, 


Aj = {we O: Hj <n: Xj(w) < th= i}. 
Next, observe that the event {X; < t} occurs if and only if at least k of the Xjs attain a 
value in (—oo, t]. For example, it holds Xj < tif at least one of the Xjs is less than or 
equal to t while we have X;, < tif X; < t for all j < n. Thus, by the definition of the Ajs 


the event {X; < t} coincides with (ey A;. Consequently, since the Ajs are disjoint, it 
follows that 


P{X; < t}= 5° P(A). (3.25) 
ick 


Let Yj = 1(_..,4(Xj). Then Y; = 1if and only if X; < t while Y; = 0 otherwise. Hence, the 
Yjs are binomial distributed with parameters 1 and p, where 


p= PY; =1} = Pf; < th}= FO. 
Since the X;s are independent, so are the Yjs and their sum! Y,+---+Y, is binomial 


distributed with parameters n and p = F(t). Note that the event A; occurs if and only if 
Y,+---+Y, =i, which implies 


P(A) = PLY) +--+ Yq =i} = ()e'a - pyMia (7)Fo'a -FOy. 3.26) 


Plugging eq. (3.26) into eq. (3.25) proves eq. (3.24). | 


11 Here we already use a result, which will be proved later on in Proposition 4.6.1. 
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Example 3.7.6. Let us choose independently and according to the uniform distribu- 
tion n numbers x, ... ,X, out of {1, ... , N}. Here, the same number may be chosen 
more than one time. Given integers m < N and k < n, find the probability that the kth 
largest number x; equals m. 

Answer: The distribution function F of the uniform distribution on {1, ... , N} 
satisfies 


m 
F(m=—, m=1,...,N. 
N 


Thus Proposition 3.7.5 implies 


P{x; < m} = s (‘) (7) (1 _ my . 
i=k 
Because of {x; = m} = {x < m}\{x; < m — 1} we obtain 
P{x, = m} = Pixg < m} -— Ph < m- 1} 
-EQ(@ e-Ce) ey] 


For example, roll a die four times and order the results in nondecreasing order as 
x} < +++ <xj. What is the probability that x} equals 5 ? 

Answer: Let us apply the previous formula with N = 6, k = 3, andn = 4. For 
m=1,... ,6this implies 


P{x} = m} = » (')| (2) (So) (™) (—e)"). 


The probabilities are 


m P{x3 = m} 


0.0162037 
0.0949074 
0.201389 
0.280093 
0.275463 
0.131944 


NOR WN FB 


thus, x} = 4 is most likely. 
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Let us now turn to the case of continuous random variables. That is, we assume that 
the random variables X; possess a distribution density p satisfying 


t 
Pay <= f poder, teR. 


Again we remark that the preceding formula holds for all j < n. Indeed, the Xjs 
are identically distributed, hence they all have the same density. A natural question 
arises: what distribution density does X; possess? 


Proposition 3.7.7. Suppose p is the common density of the X;s. Let Xf < --- < Xj be 
the order statistics of the X;. Then the distribution density px of X; is given by 


px(t) = p(t) FO)" 10 - FO)" *. 


n! 
(k-1)'(n-k)! 


Proof: It holds that 


n 


Ch ises _d i i n-i 
pill) = “PEt <= 5 ~ (7)ro (- FO) 


i 


-oi(7) porta - Foy" - Yn—a(") poFO'a- FO. 6.27) 
i=k isk 


In fact, the index i in the second sum of eq. (3.27) runs only from k to n — 1. Hence, 
shifting it by 1, this sum becomes 


Yo (n-i+1) @ : i) pO FO a- FO)". 


i=k+1 


.(n\ _ n! JF) n 
i(") G@-Din-y '* i) 


both sums in eq. (3.27) cancel out fori = k+1, ... ,n, and it remains the term fori = k. 
Thus, we obtain 


Because of 


pak (1) plo FO — FOy* 


~ reaoeeatae FO G=Foy* 


as asserted. Oo 
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Example 3.7.8. Let us choose independently and according to the uniform distribu- 
tion on [0, 1] numbers x), ... , Xn. After reordering them, we getO< xf <---<xe<1 
Which distribution does x; possess? 

Answer: The density p of the uniform distribution on [0, 1] is 1,91). Furthermore, 
its distribution function F is given by F(t) = t for O < t < 1. Thus, by Proposition 3.7.7, 
the density p; coincides with 


px(t) = rama pr’, o<¢<i; 


n! 
(k-1)\(n-k) 
As already mentioned in Example 1.6.31, this is nothing else than the density of a beta 
distribution with parameters k and n- k +1. Hence, for allk =1,... nandallO<a< 
b <1, it follows that 


b 
! 
Pla < xt <b} = Ben-wsilla, bl) = koma / xq — Kd. 
a 

Example 3.7.9. Let us investigate here the example that was already mentioned in 
Remark 3.7.4. At time t = 0 we switch on n electric bulbs of the same type. The times 
O < t{ < --- < t} are those where we observe that some of the bulbs burns out. If 
we assume that the lifetime of each bulb is exponentially distributed, what can we say 
about the distribution of the t;s? 

Answer: Let Xi, ... , Xn be the lifetimes of the n light bulbs. By assumption, they 
are independent and all exponentially distributed with some parameter A > 0. Then 
the distribution of t; is that of Xf. Furthermore, we have p(t) = Ae and F(t) = 1-e 
for t > 0. By Proposition 3.77, the distribution density p; of X; equals 


p(t) =A q eAhe1 e At(n-k+1) , t>0. 


n! 
(k-1)!(n-k)! 


For example, for t*, the time when we observe the first burnout of any of the bulbs, 
this implies 


pi(t)=Ane*™, t>0, 


that is, tf is E,,-distributed. 
3.8 Problems 


Problem 3.1. The joint distribution of a random vector X= (X1, X2) is described by 


1 

2 

5 
a 
10 


ra 
UIN 


3.8 Problems ——= 147 


Define another vector Y = (Y1, Yo) by Y; := min{X;, X>} and Y> := max{X, X>}. Find the 
probability distribution of Y = (Y;, Y2). Are Y; and Y> independent? 


Problem 3.2. Let X = (X;, X2) be uniformly distributed on the square in R* with corner 
points (0, 1), (1, 0), (0, -1), and (-1, 0). Find the marginal distributions of X. 


Problem 3.3. In a lottery, six numbers are chosen out of {1, ... , 49}. As usual in lot- 
teries, chosen numbers are not replaced. Let X;, ... ,X¢ be the chosen numbers as 
they appeared. That is, X; is the number chosen first while X¢ is the number, which 


appeared last. 

1. Determine the joint distribution of the vector ee (X,,...,X6), as wellits marginal 
distributions. 

2. Argue why X;, ... ,X6 are not independent. 

3. Reordering the six chosen numbers leads to the order statistics Xf < --- < Xé. 
Find the joint distribution of the vector (X7, ... sXe)s as well as its marginal 
distributions. 


Problem 3.4. A random variable X is geometric distributed. Given natural numbers k 
and n, show that 


P{X =k+n|X > n} = P{X =k}. 


Why is this property called “lack of memory property”? 


Problem 3.5. A random variable is exponentially distributed. Prove 


P(X >s+t|X >s)=P(X >?) 


for allt,s > 0. 


Problem 3.6. Two random variables X and Y are independent and geometrically 
distributed with parameters p and q for some 0 < p,q < 1. Evaluate P{X < Y < 2X}. 


Problem 3.7. Suppose two independent random variables X and Y satisfy 


PK==PIY=M= 3, k=12,..., 


Find the probabilities P{X < Y} and P{X = Y}. 
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Problem 3.8. Choose two numbers b and c independently, b according to the uniform 
distribution on [-1, 1] and c according to the uniform distribution on [0, 1]. Find the 
probability that the equation 


x? + bx+c=0 
does not possess a real solution. 


Problem 3.9. Use Problem 1.31 to prove the following: If X is a random variable, then 
the number of points t ¢ R with P{X = ft} > 0 is at most countably infinite. 


Problem 3.10. Suppose a fair coin is labeled with “O” and “1.” Toss the coin n times. 
Let X be the maximum observed value and Y the sum of the n values. Determine the 
joint distribution of (X, Y). Argue that X and Y are not independent. 


Problem 3.11. Suppose a random vector (X,Y) has the joint density function p 
defined by 


p(u,v) = eeu u+v<l 
0: otherwise 
Find the value of the constant c so that p becomes a density function. 
Determine the density functions of X and Y. 
Evaluate P{X + Y < 1/2}. 
Are X and Y independent? 


FWN Pp 


Problem 3.12. Gambler A has a biased coin with “head” having probability p for some 
0 < p < 1, and gambler B’s coin is biased with “head” having probability q for some 
0 < q < 1.A and B toss their coins simultaneously. Whoever lands on “head” first 
wins. If both gamblers observe “head” at the same time, then the game ends ina draw. 
Evaluate the probability that A wins and the probability that the game ends in a draw. 


Problem 3.13. Randomly choose two integers x; and x2 from 1 to 10. Let X be the min- 
imum of x, and x2. Determine the distribution and the probability mass functions of 
X in the two following cases: 

— The number chosen first is replaced. 

— The first number is not replaced. 


Evaluate in both cases P{2 < X < 3} and P{X > 8}. 


Problem 3.14. There are four balls labeled with “O” and three balls are labeled with 
“2” in an urn. Choose three balls without replacement. Let X be the sum of the values 
on the three chosen balls. Find the distribution of X. 


4 Operations on Random Variables 


4.1 Mappings of Random Variables 


This section is devoted to the following problem: let X : Q + R bea random variable 
and let f : R > R besome function. Set Y := f(X), that is, for all w ¢ OQ we have Y(w) = 
f(X(w)). Suppose the distribution of X is known. Then the following question arises: 


Which distribution does Y = f(X) possess? 


For example, if f(t) = t?, and we know the distribution of X, then we ask for the 
probability distribution of X. Is it possible to compute this by easy methods? 

At the moment it is not clear at all whether Y = f(X) is a random variable. Only if 
this is valid, the probability distribution Py is well-defined. For arbitrary functions f 
this need not to be true, they have to satisfy the following additional property. 


Definition 4.1.1. A function f : R ~ Ris called measurable if for B « B(R) the 
preimage f ‘(B) is a Borel set as well. 


Remark 4.1.2. This is a purely technical condition for f, which will not play an im- 
portant role later on. All functions of interest, for example, piecewise continuous, 
monotone, pointwise limits of continuous functions, and so on, are measurable. 


The measurability of f is needed to prove the following result. 


Proposition 4.1.3. Let X : Q + R bea random variable. If f : R + R is measurable, 
then Y = f(X) is a random variable as well. 


Proof: Take a Borel set B « B(R). Then 
Y~'(B) = X"'(f-"(B)) = X"(B) 


with B’ := f-1(B). We assumed f to be measurable, which implies B’ ¢ B(IR), and hence, 
since X is a random variable, we conclude Y~!(B) = X~1(B’) « A. The Borel set B was 
arbitrary, thus, as asserted, Y is a random variable. | 


There does not exist a general method for the description of Py in terms of Py. Only 
for some special functions, for example, for linear functions or for strictly monotone 
and differentiable, there exist general rules for the computation of Py. Nevertheless, 
quite often we are able to determine Py directly. Mostly the following two approaches 
turn out to be helpful. 
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If X is discrete with values in D := {x,,X2,...}, then Y = f(X) maps the sample 
space Q into f(D) = {f(x1), f(x), .. .}. Problems arise if f is not one-to-one. In this case 
one has to combine those x;s that are mapped onto the same element in f(D). For ex- 
ample, if X is uniformly distributed on D = {-2, -1, 0, 1, 2} and f(x) = x’, then Y = X? 
has values in f(D) = {0, 1, 4}. Combining —1 and 1 and —2 and 2 leads to 
PY = O}= P(X = O}= 5, PY =1}=P(X=-}+P(X=1}= 5, 


PUY = 4} = P(X =-2} + P(X =2}= 5. 


The case of one-to-one functions f is easier to handle because then 
P{Y =f(x%}=P{X=x}, j=1,2,...,, 


and the distribution of Y can be directly computed from that of X. 
For continuous X one tries to determine the distribution function Fy of Y. Recall 
that this was defined as 


Fy(t) = P{Y < t} = P{f(X) < ¢}. 


If we are able to compute Fy, then we are almost done because then we get the 
distribution density q of Y as derivative of Fy. 
For instance, if f is increasing, we get Fy easily by 


Fy(t) = P{X < f *(} = Fx(f “(O) 


with inverse function f~ (cf. Problem 4.15). 
The following examples demonstrate how we compute the distribution of f(X) in 
some special cases. 


Example 4.1.4. Assume the random variable X is \/(0, 1)-distributed. Which distribu- 
tion does Y := X* possess? 

Answer: Of course, Fy(t) = P{Y < t} = 0 whent < 0. Consequently, it suffices to 
determine Fy(t) for t > 0. Then 


Vi 
1 2 
Fy(t) = PIX? < = Pl-vt<X < Vt} = — | e* ds 
a I 
5 vt 
eis -$"/2 qo = 
= | eS? ds =h(/t), 


where 


u 
2 
h(w) := 4 [eePas, u>0. 
0 
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Differentiating Fy with respect to t, the chain rule and the fundamental theorem of 
Calculus lead to 


F d j ce a2. 
q(t) = Fy(t) a a) h (vt) = > i) Wii e t/2 
= Ty -t/2 
TAD f2"e"’, t>0. 


Hereby, in the last step, we used I'(1/2) = ./71. Consequently, Y possesses the density 
function 


w | ) :t<0 
Qt) = 1 -1/2,,-t/2 . 
ara te 2ES On 


But this is the density of a I, 1 -distribution. Therefore, we obtained the following 
result, which we, because of its importance, state as a separate proposition. 


Proposition 4.1.5. IfX is N(0, 1)-distributed, then X? is Ma -distributed or, equivalently, 
distributed according to x?. 


Example 4.1.6. Let U be uniformly distributed on [0, 1]. Which distribution does Y := 
1/U possess? 

Answer: Again we determine Fy. From P{X « (0, 1]} = 1 we derive P(Y > 1) = 1, 
thus, Fy(t) = 0 if t < 1. Therefore, we only have to regard numbers t > 1. Here we have 


1 1 1 
Fyt)=P,—<t}=Pj)U>-};=1--. 
ms la | | ; 
Hence, the density function q of Y is given by 


ao FO- | OES 

t 

Example 4.1.7 (Random walk on Z). A particle is located at the point 0 of Z. Ina first 
step it moves either to —1 or to +1. In the second step it jumps, independently of the 
first move, again to the left or to the right. Thus, after two steps it is located either at 
—2, O, or 2. Hereby we assume that p is the probability for jumps to the right, hence 1—p 
for jumps to the left. This procedure is repeated arbitrarily often. Let S, be the position 
of the particle after n jumps or, equivalently, after n steps.! The (random) sequence 
(Sn)noo is called a (next-neighbor) random walk on Z, where by the construction 
P{So = O} =1. 


1 S, can also be viewed as the loss or win after n games, where p is the probability to win one dollar in 
a single game, while 1 — p is the probability to lose one dollar. 
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After n steps the possible positions of the particle are in 
Dn = {-n,-n+2,...,n—2,n}. 


In other words, S, is a random variable with values in D,. Which distribution does S, 
possess? To answer this question define 


1 
Yn = 5 Sn tn). 


The random variable Y, attains values in {0,1,...,n} and, moreover, Y, = m if the 
position of the particle after n steps is 2m — n, that is, if it jumped m times to the 
right and n — m times to the left. To see this, take m = 0, hence S, = —n, which 
can only be achieved if all jumps were to the left. If m = 1, then S, = —n + 2, that 
is, there were n — 1 jumps to the left and 1 to the right. The same argument applies for 
allm <n. 

This observation tells us that Yn is Bn,p-distributed, that is, 


POFn =m} = (7 Joma py”, Hi iccasi 
Since Yn = 5 (Sp +n), if k € Dp, then it follows that 


P{s, =K}=P {Ys " ; (k+ r)} 2 ca ee (1— p)r-W22 , (4) 
o% 


For even n we have 0 € Dn, thus one may ask for the probability of S, = 0, that is, for 
the probability that the particle returns to its starting point after n steps. Applying (4.1) 
with k = 0 gives 


P{S, = 0} - (Jor -py"?, 


n 
2 


hence, if p = 1/2, then 
n\ __» n! iH 
=Ofc By. SY span 
ee @) a GDP 


An application of Stirling’s formula (1.51) implies 


n 
lim n’? Pfs, = 0} = lim n/2Y 2nn (n/e) ote = 
noo noo [ /7n (n/2e)"/?| a 


that is, ifn + oo, then P{S, = 0} ~ (2m, 
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Example 4.1.8. Suppose X is B,, ,,-distributed, that is, 


P(X =k} = ae 


)era-a" k=n,n+1,.... 


Let Y = X —n. Which probability distribution does Y possess? 
Answer: An easy transformation (cf. formula (1.34)) leads to 


POY =K)=Pix=kem=("" >! pra —pk=({)p"@-dE 2) 
for allk =0,1,... 

Additional question: Which random experiment does Y describe? 

Answer: We perform a series of random trials where each time we may obtain 
either failure or success. Hereby, the success probability is p € (0,1). Then the event 
{Y = k} occurs if and only if we observe the nth success in trial k + n. 


We conclude this section with the investigation of the following problem. Suppose 
X,,...,Xy are independent random variables. Given n measurable functions f{,..., fn 
from R to R, we define “new” random variables Yj,..., Y, by 


Yj :=fi(Xi), Isi<n. 


It is intuitively clear that then Y;,..., Y, are also independent; the values of Y; only 
depend on those of X;, thus the independence should be preserved. For example, if X; 
and X2 are independent, then this should also be valid for »G and 2X). 

The next result shows that this is indeed true. 


Proposition 4.1.9. Let X;,...,Xn be independent random variables and let (fj), be 
measurable functions from R to R. Then f,(X;), ..., fn(Xn) are independent as well. 


Proof: Choose arbitrary Borel sets B,,...,B, in R and set A; := f; (Ba, 1<i<n. 
With this notation, an w ¢€ Q satisfies f;(X;(w)) ¢ B; if and only if X;(w) ¢ A;. Hence, an 
application of the independence of X; (use 3.14 with the X;s and the Ajs) leads to 


P{fi(X1) € By, ous sfn(Xn) € By} = P{X; € Ai, eas »Xn € An} 
= P{X, € Ay} - + - P{Xn € An} = P{fi(&r) € Bi} - - + PUfn(Xn) € Bu}. 


The Bjs were chosen arbitrarily, thus the random variables f,(X;),...,fn(Xn) are 
independent as well. a 


Remark 4.1.10. Without proof we still mention that the independence of random vari- 
ables is preserved whenever they are put together into disjoint groups. For example, 
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if X,,...,X, are independent, then so are f(X;,..., Xx) and g(Xx41, ..., Xn) for suitable 
functions f and g. Assume we roll a die five times and let X;,..., X5 be the results. Then 
these random variables are independent, but so are also the two random variables 
max{X,, X>} and X3 + X, + Xs or the three X}, max{X>, X3} and min{X,, Xs}. 


4.2 Linear Transformations 


Let aand b real numbers with a # 0. Given a random variable X set Y := aX +b, thatis, 
Y arises from X by a linear transformation. We ask now for the probability distribution 
of Y. 


Proposition 4.2.1. Define Y = aX + bwitha,b« Randa #0. 


(a) Inrespective of a>Oora<0, 
t-b t-b 
Fv = Fx () or Fy()=1-P {x < <| 3 


If a < Oand Fy is continuous at ”, then Fy(t) = 1- Fx (2). 
(b) Let X be a continuous random variable with density p. Then Y is also continuous 
with density q given by 


a= 7 p(—). teR. (4.3) 


|a| a 


Proof: Let us first treat the case a > 0. Then we get 


Fy() = PlaX +b < t}= P{X < | -F: (=) 
a a 


as asserted. 
In the case a < O we conclude as follows: 


t-b t-b 
Fy(t) = PfaX +b < }= P{x> - }=1 P|x < - |. 
If Fy is continuous at —, then P|X = ral = 0, hence 


1 pfx <Phay pfx< hay rx(—), 


completing the proof of part (a). 
Suppose now that p is a density function of X, that is, 


t 
FO =PX <= f poodr, teR. 
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If a > O, by part (a) and the change of variables x = , we get 


Ie 


Fy(O=Fx( »)- frne- [3 r(%)ay - [ae 


Thus, q is a density of Y. 
If a < 0, the same change of variables? leads to 


Fy@'=1 Fe(—)- [roe fi (= ”) ay 


- fAo2)o- fhe P2)o- fave. 


This being true for all t ¢ R completes the proof. | 


Example 4.2.2. Let X be V(0, 1)-distributed. Given a # 0 and p € R, we ask for the 
distribution of Y :=aX+u. 
Answer: The random variable X is assumed to be continuous with density 


1 -t2/2 
p=—et?. 
DO) on 


We apply eq. (4.3) with b = to deduce that the density q of Y equals 


Hl t= HL di -(t- )? (2a? 
q(t) = ( ) = ex 
|a| /2n |a| 


That is, the random variable Y is N(y, |a|*)-distributed. In particular, if 0 > 0 and 
ye R, then oX + pis distributed according to V(p, 0°). 

Additional question: Suppose Y is N(u, o)-distributed. Which probability distri- 
bution does X := Yu possess? 

Answer: Formula (4.3) immediately implies that X is standard normally 
distributed. 


Because of the importance of the previous observation, we formulate it as proposition. 
Proposition 4.2.3. Suppose uw « Rando > 0. Then the following are equivalent: 


X is N(0,1)-distributed <=» oX+ is distributed according to N(u, 07). 


2 Observe that now a < 0, hence the order of integration changes and a minus sign appears. 
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Corollary 4.2.4. Let ® be the Gaussian ®-function introduced in eq. (1.62). For each 
interval [a, b], 


oy 


Nw. o¥(a,b) = 0 (2—*) yeaa 


Proof: This is a direct consequence of Proposition 4.2.3. Indeed, if X is standard 
normally distributed, then 


N(yo°\a,b) = Pla < ox +p <b} =P {SE exe “| 


eG 


as asserted. o 


Let X be an \’(u, o7)-distributed random variable. The next result shows that X with 
high probability (more than 99.7%) attains values in [u-3 0, u+3 0]. Therefore, in most 
cases, one may assume that X maps into [y — 30, u + 30]. This observation is usually 
called 3o0-rule. 


Corollary 4.2.5 (30-rule). If X is distributed according to N(u, 0), then 
P{|X -y| <20}>0.954 and P{|X-p| < 30} > 0.997. 
Proof: By virtue of Corollary 4.2.4, for each c > 0 
P{|X — p| < co} = O(c) - B(-c), 
hence the desired estimates follow by 
@(2) - B(-2) = 0.9545 and (3) - O(-3) = 0.9973. 
a 


Example 4.2.6. Let U be uniformly distributed on [0,1]. What is the probability 
distribution of aU + bifa # O and b « R? 

Answer: The distribution density p of U is given by p(t) = 1if 0 < t < 1and p(t) = 0 
otherwise. Therefore, the density g of aU + b equals 


1, t-b 
go- {mies asl 
O : otherwise. 


Assume first a > 0. Then q(t) = 1/a if and only if b < t < a+ band d(t) = 0 otherwise. 
Consequently, aU + b is uniformly distributed on [b, a + b]. 
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If, in contrast, a < 0, then q(t) = 1/\a| if and only ifa+b <t < band q(t) = 0 
otherwise. Hence, now aU + bis uniformly distributed on [a + b, b]. 
It is easy to see that the reversed implications are also true. That is, we have 


U unif. distr. on [0,1] <> aU +b unif. distr. on Beetles 0 
[a+b,b]:a<0 


Corollary 4.2.7. A random variable X is uniformly distributed on [0, 1] if and only if 1-X 
is so. In particular, if U is uniformly distributed on [0, 1], then U a 1-U. 


Example 4.2.8. Suppose a random variable X is 'y,z-distributed for some a, f > 0 and 
let a > 0. Which distribution does aX possess? 
Answer: The distribution density p of X satisfies p(t) = 0 if t < O and, if t > 0, then 


1 
a8 T(B) 


pie, 


p(t) = 


An application of eq. (4.3) implies that the density q of aX is given by q(t) = Oift < 0 
and, if t > 0, then 


=, t = 1 t ie -tlaa _ 1 B-1 ,-t/aa 
qe) =» (=) sana (3) -  G@oT@ °° 


Thus, aX is 'gq,g-distributed. 

In the case of the exponential distribution FE, = T';,,, the previous result implies 
the following: if a > 0, then a random variable X is E)-distributed if and only if ax 
possesses an Ey/q distribution. 


4.3 Coin Tossing versus Uniform Distribution 
4.3.1 Binary Fractions 
We start this section with the following statement: each real number x « [0, 1) may be 


represented as binary fraction x = 0.x,x2 - --, where x; € {0, 1}. This is a shortened way 
to express that 


The representation of x as binary fraction is in general not unique. For example, 


1 1 
a5 0.10000 ---, butalso i 0.01111 ---. 
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Check this by computing the infinite sums in both cases. 

It is not difficult to prove that exactly those x ¢ [0,1) admit two different rep- 
resentations, which may be written as x = k/2" for some n ¢ N and some k = 
1,335 5:08.09 2° = 1, 

To make the binary representation unique we declare the following: 


Convention 4.1. If a number x « [0, 1) admits the representations 
X=0.X%1--+X,11000--- and x=0.x,---XxX,4O0111---, 


then we always choose the former one. In other words, there do not exist numbers x « 
[0, 1) whose binary representation consists from a certain point only of 1s. 


How do we get the binary fraction for a given x « [0, 1) ? 

The procedure is not difficult. First, one checks whether x < 5 orx > 5. In the 
former case one takes x; = O and in the latter x, = 1. 

By ve cheats it ew thatO <x-4< 5. In the next step one asks whether 
x- a < i or x - a > i: Pos on this one chooses either x = O or x2 = 1. This 
eheies implies O < x- - - s < is and if this difference belongs either to [0, 4) or to 
[Z, is then x3 = 0 or x3 = 1, respectively. Proceeding further in that way leads to the 
binary fraction representing x. 

After that heuristic method we now present a mathematically more exact way. To 
this end, for each n > 1, we divide the interval [0, 1) into 2” intervals of length 2”. 


We start with n = 1 and divide [0, 1) into the two intervals 


Ip:=[0, 5) and h=([5 +1). 


In the second step we divide each of the two intervals Ip and J, further into two parts 
of equal length. In this way we obtain the four intervals 


ees ho =[5>7) and Iy = [ 


sal 


FlLw 


2 
Z aj qj 1 
Taya 2 yi ’ » yi + 2 ’ a1, a2 € {0, 1} . 


It is clear now how to proceed. Given n > 1 and numbers qj, ..., dy € {0, 1}, set 


ah Sar 

j j 
> r » ae : (4.4) 
jal j-l 
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In this way, we obtain 2” disjoint intervals of length 2 where the left corner points 
are 0.a;Q2 --- Gp. 
The following lemma makes the above heuristic method more precise. 


Lemma 4.3.1. For allay, ..., ay € {0, 1} the intervals in definition (4.4) are characterized 
by 


Tay ---an = {X €[0, 1) : X = O.a1a2 +++ An: ++}. 


Verbally, a number in [0, 1) belongs to Iq, ...a, if and only if its first n digits in the binary 
fraction are a,,...,Qn. 


Proof: Assume first x € Ig,...a,- If a := 0.a; - - - Ay denotes the left corner point of 
Tay,...,an> DY definition a < x < a+ 1/2" or, equivalently, 0 < x -— a < 1/2". Therefore, 
the binary fraction of x - a is of the form 0.00 - - - Obn+ - -- with certain numbers 


Dnais Dns2s 2 ie & {0, I}. This yields 
x=at+(x-a)=0.aq,-°+ Aybniy--: 


Thus, as asserted, the first n digits in the representation of x are aj, ..., Qn. 

Conversely, if x can be written as x = 0.x,x2--- with x; = @,...,Xn = Qn, thena < x 
where, as above, a denotes the left corner point of Ig,...a,. Moreover, by Convention 
4.3.1 at least one of the x;s, k > n, has to be zero. Consequently, 


co 
Xk 1 1 
ROE oe 3k > pn? 
k=n+1 k=n+1 
that is, wehavea<x<a+ wr or, equivalently, x € Ig,...a, aS asserted. Oo 


A direct consequence of Lemma 4.3.1 is as follows. 


Corollary 4.3.2. For each n > 1 the 2” sets Ig,...a, form a disjoint partition of [0, 1), 
that is, 


U Tay---aq = 0,1) and Tay-.-aq OTat...a, = @ 


Q4,...,4n€{0,1} 


provided that (a1,...,@n) # (aj, ...,a/,). Furthermore, 


{xe [0,1) : x=O}= (J Tay--ay40 


Q4,.--,4x_1€{0, 1} 
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4.3.2 Binary Fractions of Random Numbers 


We saw above each number x ¢ [0, 1) admits a representation x = 0.x;x2--- with certain 
xx € {0, 1}. What does happen if we choose a number x randomly, say according to the 
uniform distribution on [0,1] ? Then the x;s in the binary fraction are also random, 
with values in {0, 1}. How are they distributed? 

The mathematical formulation of this question is as follows: let U : Q + Rbea 
random variable uniformly distributed on [0, 1]. If w € Q, write? 


Uw) = 0X (w)Xxw) = (4.5) 
k=l 


In this way we obtain infinitely many random variables X;, : Q > {0, 1}. 
Which distribution do these random variables possess? Answer gives the next 
proposition. 


Proposition 4.3.3. Ifk < N, then 


ul 
P{X; = 0} = P{X, = 1} = oe (4.6) 
Furthermore, given n > 1, the random variables X,, ..., Xn are independent. 


Proof: By assumption Py is the uniform distribution on [0, 1]. Thus, the finite addit- 
ivity of Py, Corollary 4.3.2 and eq. (1.45) imply 


P{X; = O} = Pol U la--0430) 


4,.--,4K-1€{0,1} 


t 2d 
= > le Cee = yy 3k = Ok = 2° 


Q4,...,Ax_1€{0,1} 4,.-+,x_1€{0,1} 


Since X;, attains only two different values, P{X, = 1} = 1/2 as well, proving the first 
part. 

We want to verify that for all n > 1 the random variables X;,..., X, are independ- 
ent. Equivalently, according to Proposition 3.6.16, the following has to be proven: if 
Q,...,Qn € {0, 1}, then 


P{X, = Us 00s »Xn = An} = P{X; = a} o<75 P{Xn = an} 5 (4.7) 


By eq. (4.6) the right-hand side of eq. (4.7) equals 


1 1 1 
PIX, = ai} + PE = ay} = 56 5 = 5 
—— 


n 


3 Note that P{U ¢ [0, 1)} = 1. Thus, without losing generality, we may assume U(w) ¢ [0, 1). 
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To compute the left-hand side of eq. (4.7), note that Lemma 4.3.1 implies that we have 
X; = a, up to X, = ay if and only if U attains a value in Ig, ...¢,. The intervals Ig, ...q, are 
of length 2", hence by eq. (1.45) (recall that Py is the uniform distribution on [0, 1]), 


1 
P{X; = a,...,Xn = An} = PLU € Ia,...a,} = Pu(lay---a,) = ook 
Thus, for all a), ..., ay € {0, 1} eq. (4.7) is valid, and, as asserted, the random variables 
Xi,...,Xy, are independent. a 


To formulate the previous result in a different way, let us introduce the following 
notation. 


Definition 4.3.4. An infinite sequence X;, X2, ... of random variables is said to be 
independent provided that any finite collection of the X;s is independent. 


Remark 4.3.5. Since any subcollection of independent random variables is independ- 
ent as well, the independence of Xj, X2,... is equivalent to the following. For alln > 1 
the random variables X;,..., Xn are independent, that is, for all n > 1 and all Borel 
sets B,,..., Bn it follows that 


P{X, ¢ By,...,Xn € By} = P{X, € By} - - - P{X, € By}. 


Remark 4.3.6. In view of Definition 4.3.4 the basic observation in Example 3.6.17 may 
now be formulated in the following way. If we toss a (maybe biased) coin, labeled 
with “O” and “1,” infinitely often and if we let X;, X2,... be the results of the single 
tosses, then this infinite sequence of random variables is independent with P{X; = 
0} = 1-p and P{X; = 1} = p. In particular, for a fair coin the X;s possess the following 
properties: 

(a) Ifk«¢N, then P{X; = 0} = P{X; = 1} = 1/2. 

(b) X1,X2,...is an infinite sequence of independent random variables. 


This observation leads us to the following definition. 


Definition 4.3.7. An infinite sequence Xj, X2,... of independent random vari- 
ables with values in {0, 1} satisfying 


PRG. = Of = PR = iS W2, = i1Qoces 


is said to be a model for tossing a fair coin infinitely often. 


Consequently, Proposition 4.3.3 asserts that the random variables Xj, X2,... defined 
by eq. (4.5) serve as model for tossing a fair coin infinitely often. 
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4.3.3 Random Numbers Generated by Coin Tossing 


We saw in Proposition 4.3.3 that choosing a random number in [0, 1] leads to a model 
for tossing a fair coin infinitely often. Our aim is now to investigate the converse ques- 
tion. That is, we are given an infinite random sequence of zeros and ones and we want 
to construct a uniformly distributed number in [0, 1]. The precise mathematical ques- 
tion is as follows: suppose we are given an infinite sequence (X;)x>; of independent 
random variables with 


P{X, = O} = P{X, =I =1/72, k=1,2,.... (4.8) 


Is it possible to construct from these X;,s a uniform distributed U ? The next proposi- 
tion answers this question to the affirmative. 


Proposition 4.3.8. Let X;,X2,... be an arbitrary sequence of independent random 
variables satisfying eq. (4.8). If U is defined by 


Uw) = 8), Q, 
k-l 


then this random variable is uniformly distributed on [0, 1]. 


Proof: In order to prove that U is uniformly distributed on [0,1], we have to show 
that, ift ¢ [0, 1), then 


P{U <t}=t. (4.9) 


We start the proof of eq. (4.9) with the following observation: suppose the binary frac- 
tion of some t ¢€ [0, 1) is O.t;to - -- for certain ¢; € {0, 1}. Ifs = 0.s1s2 - --, thens < tifand 
only if there is ann ¢ N so that the following is satisfied’: 


8, =t1,...,Sn-1=tnh-1, Sn =O and t,=1. 
Fix ¢ € [0, 1) for a moment and set 


An(t) := {s € [0, 1) : 51 = th, ..., Sn-1 = tna, Sn < th}. 


4 In the case n = 1 this says s; = Oand t; = 1. 
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Of course, A,(t) n A,,(t) = @ whenever n # m and, moreover, A,(t) # @ if and only if 
tn = 1. Furthermore, by the previous remark 


[0,0 =(JAn®= LJ An(O. 


n=1 {n:ty=1} 


Finally, if A,(0) # 2, that is, if t, = 1, then 


P{U ¢ A,(t)} = P{X) = t,..., Xn = tn, Xn = OF 
1 
= P{X, = ty} -- - P{Xn-+1 = tna}: P{Xn = O} = on 


In the last step we used both properties of the X;s, that is, they are independent and 
satisfy P{X; = 0} = P{X = 1} = 1/2. 
Summing up, we get 


Pu <t=P{Ue U An(o} = > P{U « A,(t)} 
{n:tn=1} {n:tn=1} 


1 set 
= pS mL 
n= 


{n:tn=1} 


This “almost” proves eq. (4.9). It remains to show that P{U < t} = P{U < t} or, 
equivalently, P{U = t} = 0. To verify this we use the continuity of P from above. Then 


P{U = t} = P{X, = t,X2 = b,...} 
1 
= lim P(X = t,...,Xn = tn} = lim — =0. 


noo n-oo 
Consequently, eq. (4.9) holds for all t ¢ [0,1) and, as asserted, U is uniformly 
distributed on [0, 1]. | 


Remark 4.3.9. Another possibility to write U is as binary fraction 
U(w) = 0.X1(w)X2(w)---, wed. 


Consequently, in order to construct a random number u in [0, 1] one may proceed as 
follows: toss a fair coin with “O” and “1” infinitely often and take the obtained se- 
quence as binary fraction of u. The u obtained in this way is uniformly distributed 
on [0, 1]. 

Of course, in practice one tosses a coin not infinitely often. One stops the pro- 
cedure after N trials for some “large” N. In this way one gets a number u, which is 
“almost” uniformly distributed on [0, 1]. 


164 —— 4 Operations on Random Variables 


Then how does one construct n independent numbers w,...,Uy, all uniformly 
distributed on [0,1] ? The answer is quite obvious. Take n coins and toss them. As 
functions of independent observations the generated w,..., WU, are independent as 
well and, by the construction, each of these numbers is uniformly distributed on [0, 1]. 
Another way is to toss the same coin n times “infinitely often,” thus getting n infinite 
sequences of zeroes and ones. 


4.4 Simulation of Random Variables 


Proposition 4.3.8 provides us with a technique to simulate a uniformly distributed 
random variable U by tossing a fair coin. The aim of this section is to find a suitable 
function f : [0,1] > R, so that the transformed random variable X = f(U) possesses a 
given probability distribution. 


Example 4.4.1. Typical questions of this kind are as follows: find a function f so 
that X = f(U) is standard normally distributed. Does there exist another function 
g: [0,1] > R for which g(U) is By,p-distributed? 


Suppose for a moment we already found such functions f and g. According to Re- 
mark 4.3.9, we construct independent numbers u,...,U,, uniformly distributed on 
[0, 1], and set x; = f(u;) and y; = g(u;). In this way we get either n standard normally 
distributed numbers x, ..., X, or n binomial distributed numbers yj, .. . , yn. Moreover, 
by Proposition 4.1.9 these numbers are independent. In this way we may simulate 
independent random numbers possessing a given probability distribution. 

We start with simulating discrete random variables. Thus suppose we are given 
real numbers xj, X2,... and px > O with }°72, px = 1, and we look for a random variable 
X = f(U) such that 


PiX =x}=pr, k=1,2,.... 
One possible way to find such a function f is as follows: divide [0, 1) into disjoint in- 


tervals I, hb, ... of length |Ik| = pr, k = 1,2,.... Since }°72, pe = 1, such intervals exist. 
For example, take I, = [0, p;) and 


k-1 k 
n-|Ym Da). | ee 


i=1 i=1 
With these intervals I, we define f : [0,1] + R by 


fW:=xXe if xelk, (4.10) 
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or, equivalently, 


f= Dox 1.0). (4.11) 


k=1 


Then the following is true. 


Proposition 4.4.2. Let U be uniformly distributed on [0,1], and set X = f(U) with f 
defined by eq. (4.10) or eq. (4.11). Then 


P{X = x} = px, k=1,2,... 


Proof: Using that U is uniformly distributed on [0, 1], this is an easy consequence of 
eq. (1.45) in view of 


P{X = xx} = P{f(U) = xx} = PLU € Ik} = kl = pe. 


Remark 4.4.3. Note that the concrete shape of the intervals? J; is not important at all. 
They only have to satisfy |Ik| = px, k = 1,2,.... Moreover, these intervals need not 
necessarily to be disjoint; a “small” overlap does not influence the assertion. Indeed, 
it suffices that P{U ¢ I, Ij} = 0 whenever k ¢ I. For example, if always #(, n I) < «, 
k # 1, then the construction works as well. In particular, we may choose also I, = 


[Dei ps Dk Pil. 


Example 4.4.4. We want to simulate a random variable X, which is uniformly distrib- 
uted on {x;,..., xy}. How to proceed? 

Answer: Divide the interval [0, 1) into N intervals 4,..., Jy of length yn For ex- 
ample, choose Ik := [3 x), k=1,..., N.Lf = Ns xx 1;,, then X = f(U) is uniformly 
distributed on {x;,..., xy}. 


Example 4.4.5. Suppose we want to simulate a number k ¢€ No, which is Pois;- 
distributed. Set 


k-1 Nv 


k . 
N 
‘= ) -A ) -A = 
i= | - , j° ) k= 0, 15365 
j-0 j-0 


where the left-hand sum is supposed to be zero if k = 0. Choose randomly a number 
u € [0, 1] and take the k with u ¢ Ix. Then k is the number we are interested in. 


5 They do not even need to be intervals. 
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Our next aim is to simulate continuous random variables. More precisely, suppose we 
are given a probability density p. Then we look for a function f : [0,1] + R such that p 
is the density of X = f(U), that is, that 


t 
P{X < th= [ roa, teR, (4.12) 
To this end set 
t 
F() = / pdx, teR. (4.13) 


Thus, F is the distribution function of the random variable X, which we are going to 
construct. 

Suppose first that F is one-to-one on a finite or infinite interval (a,b), so that 
F(x) =0 if x < a, and F(x) = 1if.x > b. Since F is continuous, the inverse function 
F7! exists and maps (0, 1) to (a, b). 


Proposition 4.4.6. Let p be a probability density and define F by eq. (4.12). Suppose F 
satisfies the above condition. If X = F~'(U), then 


t 
px<d= [ pooax, teR, 


that is, p is a density of X. 
Proof: First note that the assumptions about F imply that F is increasing on (a, b). 
Hence, if t ¢ R, then 


P{X < t} = P{F'(U) < t} = P{U < F()} = FO = [ pixdx, teR. 


Here we used 0 < F(t) < 1 and P{U < s} = s whenever 0 < s < 1. This completes the 
proof. a 


But what do we do if F does not satisfy the above assumption? For example, this hap- 
pens if p(x) = 0 on an interval I = (a, f) and p(x) > 0 on some left- and right-hand 
intervals® of I. In this case F! does not exist, and we have to modify the construction.’ 


6 Take, for instance, p with p(x) = 5 if x ¢ [0, 1] and if x ¢ [1, 2], and p(x) = 0 otherwise. 

7 All subsequent distribution functions F possess an inverse function on a suitable interval (a, b). 
Thus, Proposition 4.4.6 applies in almost all cases of interest. Therefore, to whom the statements about 
pseudo-inverse functions look too complicated, you may skip them. 
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Definition 4.4.7. Let F be defined by eq. (4.13). Then we set 
Fi(s)=inflfeR:F@)=st, O<s <1. 


The function F-, mapping [0, 1) to [—c0, oo), is called the pseudo-inverse of F. 


Remark 4.4.8. If 0 < s < 1, then F’(s) ¢ R while F-(0) = —oo. Moreover, whenever F is 
increasing on some interval I, then F~(s) = F~'(s) for s € I. 


Lemma 4.4.9. The pseudo-inverse function F” possesses the following properties. 
1. Ifse (0,1) andte R, then 


F(F (s))=s and F (F(t))<t. 
2. Givent € (0,1) we have 
F(s)<t <= s< F(t). (4.14) 


Proof: The equality F(F (s)) = sis a direct consequence of the continuity of F. Indeed, 
if there are t, \\ F-(s) with F(t,) = s, then 


S= Jim F(tn) = F(F (s)). 


The second part of the first assertion follows by the definition of F~. 

Now let us come to the proof of property (4.14). If F-(s) < t, then the monotonicity 
of F as well as F(F“(s)) = s lead to s = F(F-(s)) < F(t). 

Conversely, ifs < F(t), then F-(s) < F(F(t)) < t by the first part, thus, 
property (4.14) is proved. a 


Now choose a uniform distributed U and set X = F (U). Since P{U = 0} = 0, we may 
assume that X attains values in R. 


Proposition 4.4.10. Let p be a probability density, that is, we have p(x) = 0 and 
J, p0ddx = 1. Define F by eq. (4.13) and let F~ be its pseudo-inverse. Take U uniform 
on [0, 1] and set X = F (U). Then pis a distribution density of the random variable X. 
Proof: Using property (4.14) it follows 


Fx(t) = P{X < th = Pfw¢Q: F (U()) < th = Plw e€ QO: UW) < FO} = FO, 


which completes the proof. a 
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Remark 4.4.11. Since F- = F-! whenever the inverse function exists, Proposition 4.4.6 
is a special case of Proposition 4.4.10. 


Example 4.4.12. Let us simulate an /V(0, 1)-distributed random variable, that is, we 
are looking for a function f : (0,1) > R such that for uniformly distributed U 


t 


_ 1 -x?/2 
Pfu) <= — fe dé. Per 


—co 


The distribution function 


t 
1 2 
M(t) = — eX 2 dx 
© al 


is one-to-one from R = (0, 1), hence Proposition 4.4.6 applies, and ®"!(U) is a standard 
normal random variable. 

How does one get an \V/(u, o”)-distributed random variable? If X is standard nor- 
mal, by Proposition 4.2.3 the transformed variable oX + p is N(u, o”)-distributed. 
Consequently, o@!(U) + ys possesses the desired distribution. 

How do we find n independent AN(y, o”)-distributed numbers x%1,...,Xn ? To 
achieve this, choose uj,...,U, in [0,1] according to the construction presented in 
Remark 4.3.9 and set x; = oD ""(uj) +p,1<i<n. 


Example 4.4.13. Our next aim is to simulate an E,-distributed (exponentially distrib- 
uted) random variable. Here 


@) : <0 
F(t) = ~ 
C) a 


which satisfies the assumptions of Proposition 4.4.6 on the interval (0, oo). Its inverse 
F-! maps (0, 1) to (0, oo) and equals 


In(1 - s) 
A > 


F-\(s) =- O<s<l. 

Therefore, if U is uniformly distributed on [0, 1], then X = —'nd-W) is E, distributed. 
This is true for any uniformly distributed random variable U. By Corollary 4.2.7 the 
random variable 1 — U has the same distribution as U, hence, setting 


In(l-(1-U)) __ In(U) 
A A? 


Y is FE) distributed as well. 
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Example 4.4.14. Let us simulate a random variable with Cauchy distribution 
(cf. Definition 1.6.33). The distribution function F is given by 


1 fh A 1 1 
F(t) = dx = — arctan(t)+-, teR, 
MT Jooo 1+ x? 1 2 


hence X := tan(1U — 5) possesses a Cauchy distribution. 
Example 4.4.15. Finally, let us give an example where Proposition 4.4.10 applies and 


Proposition 4.4.6 does not. Suppose we want to simulate a random variable X with 
distribution function F defined by 


0) t<0O 
5 :O<t<l 
F(t) = 5 :1<t<2 (4.15) 


Direct computations imply 
2s :O0<s<5 
F(s)={ 1 : s=3 
2s+1: 5 <s<l, 
hence, if X is defined by 


X= Wye + QU +Y1y1y 


then P{X < t} = F(t) with F defined by eq. (4.15). In other words, X is acting as follows. 
Choose by random a number u « [0, 1]. Ifu < is then X(u) = 2u while for u > 5 we take 
X(u) = 2u+1. 


4.5 Addition of Random Variables 


Suppose we are given two random variables X and Y, both mapping from QO into R. As 
usual, their sum X + Y is defined by 


(X + Y)(w) := X(w) + Y(w), wed. 


The main question we investigate in this section is as follows: suppose we know the 
probability distributions of X and Y. Is there a way to compute the distribution of X+ Y? 
For example, if we roll a die twice, X is the result of the first roll, Y that of the second, 
then we know Px and Py. But how do we get Px,y ? 
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Before we treat this question, we have to be sure that X + Y is also a random vari- 
able. This is not obvious at all. Otherwise, the probability distribution of X + Y is not 
defined and our question does not make sense. 


Proposition 4.5.1. If X and Y are random variables, then so is X + Y. 


Proof: We start the proof with the following observation. For two real numbers a and 
b holds a < bif and only if there is a rational number qg € Q such that a < gand b > q. 
Therefore, given t € R, it follows that 


{w ¢ O: X(w) + Y(w) < th = {we Q: XW) < t- Y(w)} 
= ht {w :X(w) < q}n{w:q<t- rw] : (4.16) 
qeQ 


By assumption, X and Y are random variables. Hence, for each g € Q, 
Aq:={w:X(w)<qteA and By:={w: Yw)<t-geA, 


which by the properties of o-fields implies Cg := Ag n Bg ¢ A. With this notation we 
may write eq. (4.16) as 


{w ©: X(w)+Y¥w)<th=(JCy. 
qcQ 


The o-field A is closed under countable union, thus, since Q is countably infinite and 
Cq ¢ A, it follows that qcO Cq ¢ A. Therefore, we have proven that, if t ¢ R, then 


{w ¢Q:X(w)+ Y(w) < the A. 


Proposition 3.1.6 lets us conclude that, as asserted, X + Y is a random variable . o 


Remark 4.5.2. In view of Proposition 4.5.1 the following question makes sense: does 
there exist a general approach to evaluate Px,y by virtue of Py and of Py? 

Answer: Such a general way does not exist. The deeper reason behind this is that, 
in order to get Px,y, one has to know the joint distribution of (X, Y). And as we saw in 
Section 3.5, in general, the knowledge of Px and Py does not suffice to determine their 
joint distribution, hence generally we also do not know Px,y. 
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The next example emphasizes the previous remark. 


Example 4.5.3. Let X, Y, X’, and Y’ be as in Example 3.5.8, that is, 


P{X = 0, Y = O} = Pe 0, YS1)= 


1 
6° 
1 
PIX=1,¥=O0}= 5, P{X =1,Y =1}= 


and 


DIP wile 


P{X’ = 0, Y’ = 0} = P{X’ = 0,Y’ = 1} = P{x’ =1,Y’ = 0} 


P{x’ =1,Y' =1 : 
{ } i 
Then Px = Py and Py = Py, but 
1 2 il 
PIX+Y=O0}= 7, MX+Y=1= 5 and PIX+Y=2}=—, 
1 1 1 
PIX’ + ¥"=O}= 7, PIX + ¥'=1}= 5 and PIX! + ¥"=2}= 7. 


Thus, X and X’ as well as Y and Y’ are identically distributed, but the sums X + Y and 
X'+Y’ are not. 


On the other hand, as we saw in Proposition 3.6.5, the joint distribution is uniquely 
determined by the marginal ones, provided the random variables are independent. 


Therefore, for independent random variables X and Y, the distribution of X + Y is 
determined by those of X and Y. The question remains, how Px,y can be computed. 


4.5.1 Sums of Discrete Random Variables 


We first consider an important special case, namely that X and Y attain values in Z. 
Here we have 


Proposition 4.5.4 (Convolution formula for Z-valued random variables). Let X and Y 
be two independent random variables with values in Z. If k « Z, then 


PIX+VY=ky= )° PX =i}- PLY =k-i}. 


i=-00 


Proof: Fix k « Zand define By ¢ Z x Z by 


By :={Gj)eZxZri+j=ak}. 
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Then we get 
P{X + Y =k} = P{(X, Y) € Bab = Py y)(Ba) (4.17) 


with joint distribution P(y,y). Proposition 3.6.9 asserts that for independent X and Y 
and Bc Z~x Z, 


Poxy)(B) = >> Px(fi})- Pr(G) = D> PIX =i} - PLY =j}. 


(ij)eB (i,j)eB 


We apply this formula with B = Bx, and by eq. (4.17) we obtain 


PIX+Y¥=kK}= D> P(X =i}-P{Y =j} 
(i) eB 


= > PXX=i}-PlY=j}= SX =-PY = ka, 


{(G)) :i=9 i=-o0 
as asserted. a 


Example 4.5.5. Two independent random variables X and Y are distributed according 
to P{X = j} = P{Y = j} = 1/2, j = 1,2,.... Determine the probability distribution of 
xX =Y¥, 

Solution: First note that P{X = j} = P{Y = j} = 0 forj < 0. Hence, given k « Z, an 
application of Proposition 4.5.4 to X and —Y yields 


PIX-Y=K= 7 PIK i}- P{-Y =k - i} SP = PLY =i-W. 
i=—oco i=1 


If k > 0, then P{Y =i-k} =0 fori<k, thus 


~ ; ; =i 4 
P{IX-Y=k}= )) P{X=i} -P{y =i NS Do a Ser 
i=k+1 isk+1 
a | | 4 2k 
_ 95k _ 9k 5-2k-2 _ 3-k-2 _ 
=2 Dyogee 2 2a, 2 oa 
i=k+1 i=0 


For k < 0 it follows that 
i twee ee 
P{X a a re ae pi ae a 
i-1 i-1 i 


We combine both cases and obtain 


| 
PIX-Y=k}= —_, keZ. 
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Which random experiment does X — Y describe? Suppose player A and B both toss 
a fair coin. Let X be the number of necessary trials for A to observe the first “head.” 
Similarly, Y describes how often B has to toss his coin to get the first “head.” Thus, the 
value of X — Y tells us how many trials later (or earlier) player A got his first “head” 
than B got his one. 

For example, if B got his first “head” one trial earlier than A, then X —- Y = 1. The 
probability that this occurs equals 1/6. 


One special case of Proposition 4.5.4 is of particular interest. 


Proposition 4.5.6 (Convolution formula for No-valued random variables). Let X and Y 
be two independent random variables with values in No. If k € No, then it follows that 


k 
PIX+Y=ky= ) PX =}- PY =k-}. 


i=0 


Proof: Regard X and Y as Z-valued random variables with P{X = i} = P{Y = i} = 0 for 
alli = -1,-2....If k € No, then Proposition 4.5.4 lets us conclude that 


co k 
PIX+Y=k}= )> PIX i}-P{Y=k-i} PIX = i} PLY = k- i}. 


i=—00 i=O 


Here we used P{X = i} = 0 fori < Oand P{Y = k—i} = Oifi>k. Fork < 0 it follows that 
P{X + Y = k} = 0 because in this case P{Y = k — i} = 0 for alli > 0. This completes the 
proof. a 


Example 4.5.7. Let X and Y be two independent random variables, both uniformly 
distributed on {1,2,...,N}. Which probability distribution does X + Y possess? 

Answer: Of course, X + Y attains only values in {2, 3,..., 2N}. Hence, P{X + Y = k} 
is only of interest for 2 < k < 2N. Here we get 


#(I;) 


PIX+Y=kh= 55, 


(4.18) 
where I; is defined by 


= fie fi,...,.N}:1<k-i< N}=fiefi,...,.N}:k-N<i<sk-}}. 
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To verify eq. (4.18) use that fori ¢ I, either P{X = i} = 0 or P{Y = k - i} = 0. Itis not 
difficult to prove that 


#(1,) = k-1 :2<k<N+1 
M )ON-k+1:N41<k<2N 
which leads to 
oh 2<k<N+1 
P{xX+Y=k}= Nels N+1<k<2N 


O : otherwise 


If N = 6, then X + Y may be viewed as the sum of two rolls of a die. Here the above for- 
mula leads to the values of P{X + Y = k}, k = 2,..., 12, which we, by a direct approach, 
already computed in Example 3.2.15. 


Finally, let us shortly discuss the case of two arbitrary independent discrete random 
variables. Assume that X and Y have values in at most countable infinite sets D and E, 
respectively. Then X + Y maps into 


D+E:={x+y:xeD, ye E}. 


Note that D + E is also at most countably infinite. 
Under these assumptions the following is valid. 


Proposition 4.5.8. Suppose X and Y are two independent discrete random variables 
with values in the (at most) countably infinite sets D and E, respectively. For z « D + E it 
follows that 


P{IX+Y=z}= 3 P(X = x}-PLY = yh. 
{(x,y)eDxE : x+y=z} 


Proof: For fixed z « D+ E define B, ¢ Dx Eby B, := {(x,y) : x + y = Z}. Using this 
notation we get 


P{X + Y =z} = P{(X, Y) € Be} = Pay (Bz), 


where again Px y) denotes the joint distribution of X and Y. Now we may proceed as 
in the proof of Proposition 4.5.4. The independence of X and Y implies 


Pcx,y)(Bz) = » P{X = x}- P{Y = y}, 
{(x,y)eDxE:x+y=z} 


proving the proposition. fa 
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Remark 4.5.9. If D = E = Z, then Proposition 4.5.8 implies Proposition 4.5.4, while for 
D = E = No we rediscover Proposition 4.5.6. 


4.5.2 Sums of Continuous Random Variables 


In this section we investigate the following question: let X and Y be two continuous 
random variables with density functions p and q. Is X + Y continuous as well, and if 
this is so, how do we compute its density? 

To answer this question we need a special type of composing two functions. 


Definition 4.5.10. Let f and g be two Riemann integrable functions from R to R. 
Their convolution f « g is defined by 


(fF «2)Q) i= / fix-ygy)dy, xeR. (4.19) 


Remark 4.5.11. The convolution is a commutative operation, that is, 
feeqgxf. 


This follows by the change of variables u = x — y in eq. (4.19), thus 
(f « g)(x) = | 1 -newey- [ feoee-wau =(g+f)X), xeR. 


Remark 4.5.12. For general functions f and g the integral in eq. (4.19) does not always 
exist for all x ¢ R. The investigation of this question requires facts and notations® 
from Measure Theory; therefore, we will not treat it here. We only state a special case, 
which suffices for our later purposes. Moreover, for concrete functions f and g it is 
mostly easy to check for which x € R the value (f * g)(x) exists. 


Proposition 4.5.13. Let p and q be two probability densities and suppose that at least 
one of them is bounded. Then (p * q)(x) exists for all x € R. 


Proof: Say p is bounded, that is, there is a constant c > 0 such that 0 < p(z) < c for all 
z € R. Since q(y) > 0, if x « R, then 


o< | px-yavday<e f aoay=c< eo. 


8 For example, exists “almost everywhere.” 
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This proves that (p « g)(x) exists for all x € R. 
Since p « q = q+ p, the same argument applies if q is bounded. o 


The next result provides us with a formula for the evaluation of the density function 
of X + Y for independent continuous X and Y. 


Proposition 4.5.14 (Convolution formula for continuous random variables). Let X and 
Y be two independent random variables with distribution densities p and q. Then X + Y 
is continuous as well, and its density r may be computed by 


co 


rd) = (p « g)@) = / ply) q(x - y) dy 


—0o 


Proof: We have to show that r = p « q satisfies 
t 
px+ysa- [ r(x)dx, teR. (4.20) 


Fix t ¢ R fora moment and define B; ¢ R* by 
Bei={wy)¢ R?:uty<th. 
Then we get 
P{X+Y < t} = PA(X, Y)  B} = Pay)(B). (4.21) 
To compute the right-hand side of eq. (4.21) we use Proposition 3.6.18. It asserts that 


the joint distribution P(x y) of independent X and Y is given by (u, y) + p(u)q(y), that 
is, if Bc R, then 


Prx,y)(B) = | | p(waqty) dy du. 
i 


Choosing B = B; in the last formula, eq. (4.21) may now be written as 


—oco co 


for} t-y 
P{IX+Y<t= [[r q(y) dy du = / / rs qty) dy. (4.22) 
Be _ 
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Next we change the variables in the inner integral as follows’: u = x—y, hence du = dx. 
Then the right-hand integrals in eq. (4.22) coincide with 


[ f nena aney= f [ ve-naoay dx 
= [ @-o0a. 


Hereby we used that p and q are non-negative, so that we may interchange the integ- 
rals by virtue of Proposition A.5.5. Thus, eq. (4.20) is satisfied, which completes the 
proof. a 


4.6 Sums of Certain Random Variables 


Let us start with the investigation of the sum of independent binomial distributed 
random variables. Here the following is valid. 


Proposition 4.6.1. Let X and Y be two independent random variables, accordingly 
Bnyp and Bm» distributed for some n,m>1, and some p « [0,1]. Then X + Y is 
Bn+m,p-distributed. 


Proof: By Proposition 4.5.6 we get that forO<k<m+n 
: n j ; m : F 
ner SUC )ee-or] [Cerone 
j=0 
Cdn m 
k n+m-k 
- pX(-p) ( ‘ ( ) 
a oe te 


j=0 


To evaluate the sum we apply Vandermonde’s identity (Proposition A.3.8), which 


asserts 
k 
HG) Or) C2") 
a T/\k-j k 
This leads to 
P{IX+Y=k}= (" .) piepyr, 
and X + Y is Byim,p-distributed. | 


9 Note that in the inner integral y is a constant. 
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Interpretation: In a first experiment we toss a biased coin n times and in a second one 
m times. We combine these two experiments to one and toss the coin now n + m times. 
Then we observe exactly k times “head” during the n + m trials if there is some j < k 
so that we had j “heads” among the first n trials and k — j among the second m ones. 
Finally, we have to sum the probabilities of all these events over j < k. 


Corollary 4.6.2. Let X;,...,Xn be independent B, )-distributed, that is, 
P{X;=O}=1-p and P{xXj=1}=p, j=1,...,n. 
Then their sum X, + - - - + Xn is Bnp-distributed. 


Proof: Apply Proposition 4.6.1 successively, first to X; and X>, then to X; + X2 and X3, 
and so on. o 


Remark 4.6.3. Observe that 
Xyt- ++ +Xn= Hj <n: X; =I}. 


Corollary 4.6.2 justifies the interpretation of the binomial distribution given in Sec- 
tion 1.4.3. Indeed, the event {X; = 1} occurs if in trial j we observe success. Thus, 
X,,...,Xn equals the number of successes in n independent trials. Hereby, the success 
probability is P{X; = 1} = p. 


In the literature the following notation is common. 


Definition 4.6.4. A sequence X;,...,X, of independent B,,,-distributed random 
variables is called a Bernoulli trial or Bernoulli process with success probability 
p« [0,1]. 


With these notations, Corollary 4.6.2 may now be formulated as follows: 
Let X;, X2,... be a Bernoulli trial with success probability p. Then for n > 1, 


n 


PUL + 4X = (7 


)pka-p, k=0,...,n. 


Let X and Y be two independent Poisson distributed random variables. Which 
distribution does X + Y possess? The next result answers this question. 


Proposition 4.6.5. Let X and Y be independent Pois,- and Pois,,-distributed for some 
A> O and p > 0, respectively. Then X + Y is Pois),,,-distributed. 


Proof: Proposition 4.5.6 and the binomial theorem (cf. Proposition A.3.7) imply 
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k ri se 
PIX+Y=kK= >» E «| a «| 


j=0 
ew Kx ies 
~ KI Tes a 
j=0 
-Aew re k 
e jay ATW aw 
il (je ake ee 


J 
Consequently, as asserted, X + Y is Poisj.,-distributed. | 


Interpretation: The number of phone calls arriving per day at some call centers A and 
B are Poisson distributed with parameters’ A and yu. Suppose that these two centers 
have different customers, that is, we assume that the number of calls in A and B is 
independent of each other. Proposition 4.6.5 asserts that the number of calls arriving 
per day either in A or in B is again Poisson distributed, yet now with parameter A + py. 


Example 4.6.6. This example deals with the distribution of raisins in a set of 
dough. More precisely, suppose we have N pounds of dough and therein are n rais- 
ins uniformly distributed. Choose by random a one-pound piece of dough. Find the 
probability that there are k > 0 raisins in the chosen piece. 

Approach 1: Since the raisins are uniformly distributed in the dough, the probab- 
ility that a single raisin is in the chosen piece equals 1/N. Hence, if X is the number 
of raisins in that piece, it is By,)-distributed with p = 1/N. Assuming that N is big, the 
random variable X is approximately Pois,-distributed with A = n/N, that is, 


Ak 
P{X = k}= “eo 2 en 
Note that A = n/N coincides with the average number of raisins per pound dough. 
Approach 2: Assume that we took in the previous model N — ov, that is, we have 
an “infinite” amount of dough and “infinitely” many raisins. Which distribution does 
X, the number of raisins in a one-pound piece, now possess? 
First we have to determine what it means that the amount of dough is “infin- 
ite” and that the raisins are uniformly distributed" therein. This is expressed by the 


following conditions: 

(a) The mass of dough is unbelievably huge, hence, whenever we choose two dif- 
ferent pieces, the numbers of raisins in each of these pieces are independent of 
each other. 


10 Later on, in Proposition 5.1.16, we will see that A and y/ are the mean values of arriving calls per day. 
11 Note that the multivariate uniform distribution only makes sense (cf. Definition 1.8.10) if the 
underlying set has a finite volume. 
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(b) The fact that the raisins are uniformly distributed is expressed by the following 
condition: suppose the number of raisins in a one-pound piece is n > 0. If this 
piece is split into two pieces, say K; and K2 of weight a and 1 - a pounds, then 
the probability that a single raisin (of the n) is in K, equals a, and that it is in Kz 
isl-a. 

Fix 0 < a < 1 and choose in a first step a piece K, of a pounds and in a second one 

another piece Kz of weight 1 — a. Let X,; and X2 be the number of raisins in each of 

the two pieces. By condition (a), the random variables X; and X> are independent. If 

X := X; + X, then X is the number of raisins in a randomly chosen one-pound piece. 

Suppose now X = n, that is, there are n raisins in the one-pound piece. Then by con- 

dition (b), the probability for k raisins in K, is described by the binomial distribution 

Bn,a- Recall that the success probability for a single raisin is a, thus, X; = k means, we 

have k times success. This may be formulated as follows: for 0 < k <n, 


P{X, = k|X = n} = Bna({k}) = (1) aX(i=a)*. (4.23) 


Rewriting eq. (4.23) leads to 


P{X; k, X? n k} P{X; k, X n} 
= P{X, = k|X = n}- P{X = n} = P{X =n} - (Ja aor" (4.24) 


Observe that in contrast to eq. (4.23), eq. (4.24) remains valid if P{X = n} = 0. Indeed, 
if P{X = n} = 0, by Proposition 4.5.6, the event {X; = k, X2 = n—k}has probability zero 
as well. 

The independence of X; and X2 and eq. (4.24) imply that, ifn = 0,1,... and k = 
O,...,n, then 


PX, = kK} P{X = n— kj = PIX = n}- (;,)ar sa, 
Seting k = n, we get 
P{X, = n}- P{X2 = 0} = P{X = n}-a", (4.25) 
while for n > 1 and k = n- 1 we obtain 
P{X, =n-1}- P{X) = 1} = P{X =n}-n-a""(1-a). (4.26) 
In particular, from eq. (4.25) follows P{X2 = 0} > 0. If this probability would be zero, 


then this would imply P{X = n} = 0 for all n € No, which is impossible in view of 
P{X « No} =1. 
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In a next step we solve eqs. (4.25) and (4.26) with respect to P{X = n} and make 
them equal. Doing so, for n > 1 we get 


a 1 P{X = 1} _ 
P{X, = n} = (1-a) PIX = 0} P{X, = n-1} 
= a -P{X, =n-1}, (4.27) 
where A > 0 is defined by 
-, P{X)=1} 
- 1. 
A:=(1-a) PIX) = 0} (4.28) 


Do we have A > 0? IfA = 0, then P{X, = 1} = 0 and by eq. (4.26) follows P{X = n} = 0 for 
n > 1. Consequently, P{X = 0} = 1, which says that there are no raisins in the dough. 
We exclude this trivial case, thus it follows that A > 0. 

Finally, a successive application of eq. (4.27) implies for n « No” 


ar n 
P{X, = n} = ( » - P{X; = O}, (4.29) 
leading to 
co co n 
1= S > P{X; = n} = P(X; = O}- GA" _ P(X, = o}e%, 
n=0 n=0 ni 


that is, we have P{X, = 0} = e-™. Plugging this into eq. (4.29) gives 


P{X, = n} = 


n 
ot ow 
and X; is Poisson distributed with parameter aA. 

Let us interchange now the roles of X; and X2, hence also of a and 1-a. An applica- 
tion of the first step to X) tells us that it is Poisson distributed, but now with parameter 
(1 - a)A’, where in view of eq. (4.28) A’ is given by? 


PIX; =1} ade — 


Ne=a'. a 
P{X; = O} em 


A. 


Thus, X2 is Poisy_),-distributed. 


12 If n = 0, the equation holds by trivial reason. 
13 Observe that we have to replace X2 by X; and 1-abya. 
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Since X; and X2 are independent, Proposition 4.6.5 applies, hence X = X; + X2 is 
Pois,-distributed or, equivalently, 


k 


A 
P{There are k raisins in an one pound piece} = u e 


-A 


Remark 4.6.7. Which role does the parameter A > 0 play in this model? As already 
mentioned, Proposition 5.1.16 will tell us that A is the average number of raisins per 
pound dough. Thus, if > O and we ask for the number of raisins in a piece of p 
pounds, then this number is Pois,,-distributed", that is, 


ae aa a 
P{k raisins in p pounds dough} = —— e? 


Assume a dough contains on the average 20 raisins per pound. Let X be number of 
raisins in a bread baked of five pounds dough. Then X is Poisj99-distributed and 


P({95 < X < 105})=0.4176, P({90 <X < 110}) = 0.7065, 
P({85 < X < 115}) = 0.8793, P({80 < X < 120}) = 0.9599, 
P({75 < X < 125}) = 0.9892, P({70 < X < 130}) = 0.9976. 


Additional question: Suppose we buy two loaves of bread baked from p pounds dough 
each. What is the probability that one of these two loaves contains at least twice as 
many raisins than the other one? 

Answer: Let X be the number of raisins in the first loaf, and Y is the number of 
raisins in the second one. By assumption, X and Y are independent, and both are 
Pois,,-distributed, where as before A > 0 is the average number of raisins per pound 
dough. The probability we are interested in is 


P{X > 2Y or Y > 2X} = P{X > 2Y} + P{Y > 2X} = 2P{X > 2Y}. 


It follows 


2P(X > 2Y) =2 Y PY = k,X > 2k) =2 Y > PY = k)- P(X > 2k) 
k=0 k=0 


“2 k)- > PUK = De= ca ae oS 


j=2k j=2k 


14 Because on average there are pA raisins in a piece of p pounds. 
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If the average number of raisins per pound is A = 20, and if the loaves are baked from 
p =5 pounds dough, then this probability is approximatively 


P(X > 2Y or Y > 2X) = 3.17061x 10°. 
If p = 1, that is, the loaves are made from one-pound dough each, then 
P(X > 2Y or Y > 2X) = 0.0430079. 


Now we investigate the distribution of the sum of two independent negative binomial 
distributed random variables. Recall that X is B,, ,-distributed if 


PO=K= (1, PhD-DE", k=n,n+1,.... 


Proposition 4.6.8. Let X and Y be independent accordingly B,,,, and B,,, distributed 


for some n,m > 1. Then X + Y is By, m,p-distributed. 


Proof: We derive from Example 4.1.8 that, if k ¢ No, then 
mn n k —m m k 
Pa-n=K=({') (p-1)* and po-m==({")p (p-1)*. 


An application of Proposition 4.5.6 to X — n and Y — m implies 


P{X+Y-(n+m) =k} = 3 I(j "e = | (pene - ym] 


j=0 
kK (-n\ (-m 
“rng ()(o"), 
mr ea 
j 
To compute the last sum we use Proposition A.5.3, which asserts that 
eT) Ce") 
ap vA ET k 
Consequently, for each k € No, 


P{X+Y-(n+m)=k}= "pn my i 


Another application of eq. (4.2) (this time with n + m) leads to 


k-1 


PY = I= (1 my 


ae, k=n+mn+m+l,.... 


This completes the proof. a 


184 —— 4 Operations on Random Variables 


Corollary 4.6.9. Let X;,...,Xn be independent Gp-distributed (geometric distributed) 
random variables. Then their sum X, + - - - + Xnis B, p-distributed. 


Proof: Use Gy = Boy and apply Proposition 4.6.8 n times. o 


Interpretation: The following two experiments are completely equivalent: one is to 
play the same game as long as one observes success for the nth time. The other ex- 
periment is, after each success to start a new game, as long as one observes success in 
the nth (and last) game. Here we assume that all n games are executed independently 
and possess the same success probability. 

Let U and V be two independent random variables, both uniformly distributed 
on [0, 1]. Which distribution density does X + Y possess? 


Proposition 4.6.10. The sum of two independent random variables U and V, uniformly 
distributed on [0, 1], has the density r defined by 


x :O0<x<l 
r(x) =42-x:1<x<2 (4.30) 
O :otherwise. 


Proof: The distribution densities p and q of U and V are given by p(x) = q() = 1if 
0 < x < land p(x) = q(x) = 0 otherwise. Proposition 4.5.14 asserts that U + V has 
density r = p * q computed by 


co 


1 
ro = [ pee-yaoyay = | p(x -y)dy. 


—co 


But, p(x - y) = 1if and only if 0 < x -y < 1or, equivalently, ifand only ifx-1<y<-x. 
Taking into account the restriction 0 < y < 1, it follows p(x — y)q(y) = 1 if and only if 
y € [ max{x — 1, 0}, min{x, 1}]. In particular, r(x) = 0 for x ¢ [0, 2], and if 0 < x < 2, then 


r(x) = min{x, 1} - max{x - 1, O}. 


It is not difficult to see that r may also be written as in eq. (4.30). This completes the 
proof. ia 


Application: Suppose we choose independently and according to the uniform distri- 
bution two numbers u, and wp in [0, 1]. Then the probability that a < u, + up < b equals 
i r(x) dx with r given by eq. (4.30). For example, 


i! 3 : ae 3 
P[3smtms3}-[ xaxe (2-x)dx=—. 
2 2 1/2 1 4 
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We investigate now the sum of two gamma distributed random variables. Recall that 
the density of a I'y,g-distributed random variable is given by 


1 
ab T(B) 


x1 eo Xx/a 


Papx) = 
if x > 0, while pg,g(x) = 0 otherwise. 


Proposition 4.6.11. Let X; and X, be two independent random variables distributed 
according to T'y,g, and T'g,g,, respectively. Then X, + Xz is Tg,g,+g,-distributed. 


Proof: If r denotes the density of X; + Xz, Proposition 4.5.14 implies 
ro) = apy * Pap ded = f Pan X-WPam Way, xeR, (431) 


—oo 


and we have to show that r = pg.g,+p)- 
It is easy to see that r(x) = O if x < 0, hence it suffices to evaluate eq. (4.31) for 
x > 0. Since Pa,g,(x — y) = Oify > x, 


r(x) = rig yet e/a g-G-y/a dy 


y 
aPi*B2T(B) (Bo) 
0 


x 
1 ee y\hi-1 y\ B21 
————— Bi+B2-2 @-x/a Z Se 
abr*PT (BT (Bo) ~ ° / (,) (1 2 dy: 
) 


Changing the variable as u := y/x, hence dy = x du, we obtain 


1 
r(x) = xPr*Bo-} a-x/a / uP — wd 
10) 


1 
ahi BT (By )T (Bz) 


_ BB B2) ppp a -x/a 
ahi+b2 T(B1) T(B2) x ee 32) 


where B denotes the beta function defined by eq. (1.58). Equation (1.59) yields 
B(Bi, B2) _ 1 
T(B)T(B2) T(Bi + B2)’ 


hence, if x > 0, then by eqs. (4.32) and (4.33) it follows that r(x) = Da,p,+p,(X). This 
completes the proof. a 


(4.33) 


Recall that the Erlang distribution is defined as Ej, = ',-1,,. Thus, Proposition 4.6.11 
implies the following corollary. 


Corollary 4.6.12. Let X and Y be independent and distributed according to Ej, and 
Ej.m respectively. Then their sum X + Y is Ej. n+m-distributed. 
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Another corollary of Proposition 4.5.14 (or of Corollary 4.6.12) describes the sum of 
independent exponentially distributed random variables. 


Corollary 4.6.13. Let X;,...,X, be independent E)-distributed. Then their sum 
X, +--+ +X, is Erlang distributed with parameters A and n. 


Proof: Recall that E, = E,1. By Corollary 4.6.12 X; + X> is E,,2-distributed. Proceeding 
in this way, every time applying Corollary 4.6.12 leads to the desired result. o 


Example 4.6.14. The lifetime of light bulbs is assumed to be E)-distributed for a cer- 
tain A > O. At time zero we switch on a first bulb. In the moment it burns out, we 
replace it by a second one of the same type. If the second burns out, we replace it by a 
third one, and so on. Let S, be the moment when the nth light bulb burns out. Which 
distribution does S, possess? 

Answer: Let X;,X,... be the lifetimes of the first, second, and so on light bulb. 
Then S, = X, + - - - + Xy. Since the light bulbs are assumed to be of the same type, 
the random variables X; are all E)-distributed. Furthermore, the different lifetimes do 
not influence each other, thus, we may assume that the X;s are independent. Now 
Corollary 4.6.13 lets us conclude that S;, is Erlang distributed with parameters A and n, 
hence, if t > 0, by Proposition 1.6.25 we get 


£ 

qn n-1 At j 

P{S, < t} = (n-1! ee e™ dx =1- ) ett . (4.34) 
. : j-0 : 


Example 4.6.15. We continue the preceding example, but ask now a different 
question. How often do we have to change light bulbs before some given 
time T>0? 

Answer: Let Y be the number of changes necessary until time T. Then for n > 0 the 
event {Y = n} occurs if and only if S, < T, but Spi; > T. Hereby, we use the notation of 
Example 4.6.14. In other words, 


P{Y = n} = P{Sp < T, Snr > T} = P({Sn < T}\ {Snr <TH), n=0,1,.... 
Since {Spii < T} ¢ {Sy < T}, by eq. (4.34) follows that 


P{Y =n} = P{S, < T}-P{Snui < TH 


n-1 ; n ; 
ATY AT 
1-52 AY goat) | yy OY et 
j-0 7" jo 7° 
(AT)" 


e 4? = Poisar({n}). 


n! 
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Let us still mention an important equivalent random “experiment”: customers arrive 
at the post office randomly. We assume that the times between their arrivals are inde- 
pendent and EF)-distributed. Then S,, is the time when the nth customer arrives. Hence, 
under these assumptions, the number of arriving customers until a certain time T > 0 
is Poisson distributed with parameter AT. 


We investigate now the sum of two independent chi-squared distributed random 
variables. Recall Definition 1.6.26: A random variable X is y?-distributed if it is Ty 2- 
distributed. Hence, Proposition 4.6.11 implies the following result. 


Proposition 4.6.16. Suppose that X is y?-distributed and that Y is x?,-distributed for 
some n,m > 1. If X and Y are independent, then X + Y is x?,,,,-distributed. 


Proof: Because of Proposition 4.6.11, the sum X+YisT, 2,m = X3..m-distributed. This 
proves the assertion. a 


Proposition 4.6.16 has the following important consequence. 


Proposition 4.6.17. Let X,,...,Xn be a sequence of independent N (0, 1)-distributed 
random variables. Then X? + - - - + X? is x?-distributed. 


Proof: Proposition 4.1.5 asserts that the random variables X} are x}-distributed. Fur- 
thermore, because of Proposition 4.1.9 they are also independent. Thus a successive 
application of Proposition 4.6.16 proves the assertion. a 


Our next and final aim in this section is to investigate the distribution of the sum of two 
independent normally distributed random variables. Here the following important 
result is valid. 


Proposition 4.6.18. Let X,; and X2 be two independent random variables distributed 
according to N (yu, 07) and N (uo, 03). Then X; + Xz is N (yy + pa, of + 05)-distributed. 


Proof: Ina first step we treat a special case, namely = 2 = O and oj = 1. To simplify 
the notation, set A = 02. Thus we have to prove the following: if X; and Xz are N’(0, 1)- 
and N(0, A’)-distributed, then X, + X> is N(0, 1+ A?)-distributed. 

Let Po,1 and po,,2 be the corresponding densities introduced in eq. (1.47). Then we 
have to prove that 


Po,1 * Po,22 aa Po,1+22 . (4.35) 
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To verify this start with 
(Do,1 * Po,2)(x) = — / - eo OW 2 ey [2 dy 
= _ : - e202 HH) gy, (4.36) 
We use 


x? - Ixy +(14A%)y? 


2 dl 
= -2y1/2,, _ <2) =I) \" 2 = 
= (G+A yy -(1+A%) x) x Gee i) 


= (a 4X4 Nye) +—_ 
(o-D*i 


with a := (1+A-*)". Plugging this transformation into eq. (4.36) leads to 


eX /2(1+A2) 


(Po,1 * Po) o> aaa 


/ ea) 2 dy, (4.37) 


—co 


Next change the variables by u := ay — x/a, thus, dy = du/a, and observe that aA = 
(1+ A2)"2. Then the right-hand side of eq. (4.37) transforms to 


er /2(1+A2) 


_ 5,2 
Qn (1+ A2)U2 / ev? du = Pou2(X). 


—oo 


Hereby, we used Proposition 1.6.6 asserting [~~ eV? dy = J2n. This proves the 
validity of eq. (4.35). 
In a second step we treat the general case, that is, X; is NV (1, 07)- and X> is 
N(u2, 05)-distributed. Set 
Xi - X2- 
Y, = ois a and Y> = 42 ha 7 
O71 02 
By Proposition 4.2.3, the random variables Y; and Y2 are standard normal and, 
moreover, because of Proposition 4.1.9, also independent. Thus, the sum X; + X2 may 
be represented as 


X + X2 = Wy +2 + 01Y1 + 02Y2 =P + pa + OZ 


where Z = Y; + AY) with A = 09/04. 
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An application of the first step shows that Z is (0,1 + A*)-distributed. Hence, 
Proposition 4.2.3 implies the existence of a standard normally distributed Zp such that 
Z = (1+ A2)"? Zo. Summing up, X; + X) may now be written as 


X, + X> =pit+p2+ 01 (1+A2)!2 Z =Myt+ pat (0; +03)!” Zo 5 


and another application of Proposition 4.2.3 lets us conclude that, as asserted, the 
sum X, + X) is N (jy + Ho, 07 + 03)-distributed. ] 


4.7 Products and Quotients of Random Variables 


Let X and Y be two random variables mapping a sample space Q into R. Then their 
product X - Y and their quotient X/Y (assume Y(w) ¢ 0 for w € Q) are defined by 


X(w) 


Vw)’ eQ. 


(X- Y)(w) := X(w)- Y(w) and (7 )w (w) := 
The aim of this section is to investigate the distribution of such products and quo- 
tients. We restrict ourselves to continuous X and Y because, later on, we will only 
deal with products and quotients of those random variables. Furthermore, we omit the 
proof of the fact that products and fractions are random variables as well. The proofs 
of these permanent properties are not complicated and follow the ideas used in the 
proof of Proposition 4.5.1. Thus, our interest are products X - Y and quotients X/Y for 
independent X and Y, where, to simplify the computations, we suppose P{Y > 0} = 1. 
We start with the investigation of products of continuous random variables. Thus, 
let X and Y be two random variables with distribution densities p and q. Since we 
assumed P{Y > 0} = 1, we may choose the density q such that q(x) = Oif x < 0. 


Proposition 4.7.1. Let X and Y be two independent random variables possessing the 
stated properties. Then X -Y is continuous as well, and its density r may be calculated by 


rox) = f° »(*) MD yy, xeR. (4.38) 


Proof: For t < R we evaluate P{X - Y < ft}. To this end fix t « R and set 
At := {(u, y) € R x (0, 00) :u-y < t}. 


As in the proof of Proposition 4.5.14, it follows that 


oo | tly 
PIX-¥ < t= Pay (Ad = / / p(w) du} qyay. (4.39) 


0 —o°o 
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In the inner integral of eq. (4.39) we change the variables by x := uy, hence dx = 
y du. Notice that in the inner integral y is a constant. After this change of variables the 
right-hand integral in eq. (4.39) becomes 


{[doG)=] SJ fo) 9e] = fom 


This being valid for all t ¢ R, the function r is a density of X - Y. a 


Example 4.7.2. Let U and V be two independent random variables uniformly distrib- 
uted on [0, 1]. Which probability distribution does U - V possess? 

Answer: We have p(y) = q(y) = 1if 0 < y < 1, and p(y) = q(y) = 0 otherwise. 
Furthermore, 0 < U-V <1, hence its density r satisfies r(x) = 0 if x ¢ [0, 1]. For x « [0, 1] 
we apply formula (4.38) and obtain 


co { 
rod [ (5) eo InGo = In (5) , O<x<l. 


Consequently, if0 <a<b <1, then 


b 
Pla<U-V<b}=— f In@oar [xInx x|® = aln(a) bln(b)+b-a. 


In particular, it follows that 
P{U-V<t}=t-tlnt, O<t<l. (4.40) 


Our next objective are quotients of random variables X and Y. We denote their 
densities by p and q, thereby assuming q(x) = 0 if x < 0. Then we get 


Proposition 4.7.3. Let X and Y be independent with P{Y > 0} = 1. Then their quotient 
X/Y has the density r given by 


co 


roo = [ ypxnanay, xeR. 
10) 


Proof: The proof of Proposition 4.7.3 is quite similar to that of 4.71. Therefore, we 
present only the main steps. Setting 


At := {(u,y) € Rx (0, 00) :u< ty}, 


15 The interchange of the integrals is justified by Proposition A.5.5. Note that p and q are nonnegative. 
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we obtain 
oof ty 
PIKIY) <= Pan(Ad= [| f ped du| aoray. (4.41) 
0) —0o 


We change the variables in the inner integral of eq. (4.41) by putting x = u/y. After that 
we interchange the integrals and arrive at 


t 
P{(X/Y) < t} = . Oddy 


for all t ¢ R. This proves that r is a density of X/Y. a 


Example 4.7.4. Let U and V be as in Example 4.72. We investigate now their quotient 
U/V. By Proposition 4.7.3 its density r can be computed by 


fore) 1 4 
r(x) = / y p(xy)q(y) dy = / yp(xy)dy = / ydy= : 


2 
0 0 0 
in the case 0 <x <1. If1< x < ow, then p(xy) = Oify > 1/x, and it follows that 


1/x 


r(x) = [vy = 


0 


2x2 
for those x. Combining both cases, the density r of U/V may be written as 


O<x<l 
r(x) = sa: 1<Xx<0o 


O : otherwise. 


Question: Does there exist an easy geometric explanation for r(x) = 5 in the case 
O<x<1? 
Answer: If t > 0, then 


Fyjy(t) = P{U/V < th} = P{U <tV}=PyylAb, 
where 
At := {(u,v) € [0,1]? : 0 <u<vé}. 


If 0 <t <1, then 4; is a triangle in [0, 1]? with area vol,(A;) = 5. The independence of 
U and V implies (cf. Example 3.6.21) that Py,y) is the uniform distribution on [0, 1], 
hence 


t 
Fyyy(t) = Piu,v)(Ae) = vola(Az) = 57 O<tsi, 


leading to r(t) = Fy,(0 = + for those t. 
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4.7.1 Student’s t-Distribution 


Let us use Proposition 4.7.3 to compute the density of a distribution, which plays a 
crucial role in Mathematical Statistics. 


Proposition 4.7.5. Let X be N(0,1)-distributed and Y be independent of X and y?- 
distributed for some n > 1. Define the random variable Z as 
7 Xx 
JY/n- 
Then Z possesses the density r given by 


n+1 —n/2-1/2 
r(x) = aan (1+*) , xeR. (4.42) 
2 


Proof: Ina first step we determine the density of VY with Y distributed according to 
x2. Ift > 0, then 


Pe 


1 
ll a2 gy. 
an? (5) / 
0 


F y(t) = P{VY < th= PY <7} = 


Thus, if t > 0, then the density qg of VY equals 


d 1 2 
t) = —F ,(t) = 2) tet? 
qt) dt y(t) aT (8 | ) e 
4 
ee fie fe, (4.43) 


Of course, we have q(t) = Oift < 0. 
In a second step we determine the density 7 of Z = Z/./n = X//Y. An application 
of Proposition 4.7.3 for p(x) = e */2 and q given in eq. (4.43) leads to 


a 1 2 1 
ce { Sian °] J nigy “| sy 
y J2n 2n2-1p (2) 
—(1+x?)y?/2 
y"e dy. (4.44) 
~ Jaan AF] (3) 7) 


0) 


Change the variables in the last integral by v = xa +x’). Then y = a aon and, 


consequently, dy = B v-t2 (1+ x2)-/? dv. Inserting this into eq. (4.44) shows that 
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1 Pf 2nl2 ynl2-1/2 ev 
r(x) = dv 
J 2ni2 cr (3) (1 ae x2)ni2 +1/2 
0 


n+1 
claws ) (1+x2)" -n/2- 12 (4.45) 


~ Yat) 


In a third step, we finally obtain the density r of Z. Since Z = /nZ, formula (4.3) 
applies with b = 0 and a = ./n. Thus, by eq. (4.45) for 7, as asserted, 


do +7(%)-4 o. ee : 


Definition 4.7.6. The probability measure on (R, B(R)) with density r, given by 
eq. (4.42), is called ty-distribution or Student’s t-distribution with n degrees of 
freedom. A random variable Z is said to be ty-distributed or t-distributed with n 
degrees of freedom, provided its probability distribution is a t,-distribution, that 


is, fora<b 
b =n/2-1/2 
(: + “) dx. 


EE an) eth aa 


(5) 


Remark 4.7.7. The t,-distribution coincides with the Cauchy distribution introduced 
in 1.6.33. Observe that I'(1/2) = ./m and (1) = 1. 


In view of Definition 4.7.6, we may now formulate Proposition 4.7.5 as follows. 


Proposition 4.7.8. If X and Y are independent and N(0,1)- and ce distributed, then 
Xs _Aietri 
Wah is t,-distributed. 


Proposition 4.6.17 leads still to another version of Proposition 4.75: 


Proposition 4.7.9. If X,X1,...,Xn are independent N (0, 1)-distributed, then 
xX 
(he 


is t,-distributed. 
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Corollary 4.7.10. If X and Y are independent and N (0, 1)-distributed, then X/|Y| pos- 
sesses a Cauchy distribution. 


Proof: An application of Proposition 4.7.9 with n = 1 and X; = Y implies that X/|Y| 
is t;-distributed. We saw in Remark 4.77 the t, and the Cauchy distribution coincide, 
thus, X/|Y| is also Cauchy distributed. a 


4.7.2 F-Distribution 


We present now another important class of probability measures or probability 
distributions playing a central role in Mathematical Statistics. 


Proposition 4.7.11. For two natural numbers m and n let X and Y be independent and 
X2,- and x?-distributed. Then Z := — has the distribution density r defined as 


6) :x <0 
r(x) = minl2 pn. et ym/2-I es (4.46) 


TIS) Gaxenre 


Proof: We first evaluate the density 7 of Z = X/Y. To this end we apply Proposition 
4.73 with functions p and q given by 


1 al 
x) = — —=x'2-re-xl2 and = On2-1 gy 
PO) = sma T(m/2) WW) = se Tenn ¥ 


whenever x, y > 0. Then we get 


~ 1 m/2-1,n/2-1 ,-xy/2 ,-y/2 
r(x) = e ed 
©) = sme EOm/D nD | oe 
0 
2-1 - 
xml ylmen)/2 -1 g-y(1+x)/2 dy. (4.47) 


~ 20m+0)2 F(m/2) T(n/2) J 


We replace in eq. (4.47) the variable y by u = y(1+ x)/2, thus, dy = cy du. Inserting this 
into eq. (4.47), the last expression transforms to 


ras sae | fete 
ae ee V+ x -(n+m U m+n ~ el du 
T(m/2) T(n/2) ( ) 7 
r(™") eis 


“Feare Gor 
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.Z, we obtain the density r of Z by Proposition 1.7.17. Indeed, then 


rm) xm 
T(B)T(Z) Gms nine 


Because of Z = 2 


m . (mx 
a= —F7 (—) = m™2 yn. 
n\n 


as asserted. a 


Remark 4.7.12. Using relation (1.59) between the beta and the gamma functions, the 
density r of Z may also be written as 


mini2 nn/2 xin/2 -1 
r(x) = : SO): 
B(3,5) (mx +njormne 


Definition 4.7.13. The probability measure on (R, B(R)) with density r defined by 
eq. (4.46) is called Fisher-Snecedor distribution or F-distribution (with m and 
n degrees of freedom). 

A random variable Z is F-distributed (with m and n degrees of freedom), 
provided its probability distribution is an F-distribution. Equivalently, if0<a<b, 
then 


b 
T (me m/2-1 
( Z ) is dx. 
P=) J Garry 
a 


Pla<Z <b} =m? pr. 
The random variable Z is also said to be Fm. n-distributed. 


With this notation, Proposition 4.7.11 may now be formulated as follows: 


Proposition 4.7.14. If two independent random variables X and Y are x?,- and x? 
distributed, then ae is Fin n-distributed. 


Finally, Proposition 4.6.17 implies the following version of the previous result. 


Proposition 4.7.15. Let X,,...,Xm,Yi,...,Yn be independent N(0, 1)-distributed. 
Then 
Ln XF 


ay 


3|r 


Bile 


is Fin. n-distributed. 


Corollary 4.7.16. If a random variable Z is Fmn-distributed, then 1/Z possesses an Fn.m 
distribution. 


Proof: This is an immediate consequence of Proposition 4.7.11. a 
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4.8 Problems 


Problem 4.1. Let U be uniformly distributed on [0,1]. Which distributions do the 
following random variables possess 


1 
min{U,1-—U}, max{U,1—-U}, |2U -1| and lu- ;| ? 
Problem 4.2 (Generating functions). Let X be a random variable with values in No. 


For k € No let p, = P{X = k}. Then its generating function @y is defined by 


px(t) = Do pet. 


k=0 


Show that @ x(t) exists if |t| < 1. 
2. Let X and Y be two independent random variables with values in No. Prove that 
then 


Px+y = Px: Py. 


3. Compute @x in each of the following cases: 
(a) X is uniformly distributed on {1,...,N} for some N > 1. 


(b) X is B,,)-distributed for some n > 1 and p « [0, 1]. 
(c) X is Pois,-distributed for some A > 0. 

(d) X is G,-distributed for a certain 0 < p <1. 

(e) Xis Bp distributed. 


Problem 4.3. Roll two dice simultaneously. Let X be result of the first die and Y that 
of the second one. Is it possible to falsify these two dice in such a way so that X + Y 
is uniformly distributed on {2, ..., 12}? It is not assumed that both dice are falsified in 
the same way. 

Hint: One possible way to answer this question is as follows: investigate the gen- 
erating functions of X and Y and compare their product with the generating function 
of the uniform distribution on {2, ..., 12}. 


Problem 4.4. Let X;,...,X, be a sequence of independent identically distributed 
random variables with common distribution function F and distribution density p, 
that is, 


t 
Px <d=FO= [ pode, jet 


—oo 
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Define random variables X, and X* by 
X,, = min{X;,...,Xn} and X* := max{X),...,Xn}. 


Determine the distribution functions and densities of X, and X*. 

2. Describe the distribution of the random variable X, in the case that the X;s are 
exponentially distributed with parameter A > 0. 

3. Suppose now the Xjs are uniformly distributed on [0, 1]. Describe the distribution 
of X,, and X* in this case. 


Problem 4.5. Find a function f from (0, 1) to R such that 


1 
PFU) =k}= =, k= 1,2, 2.0% 


for U uniformly distributed on [0, 1]. 


Problem 4.6. Let U be uniform distributed on [0, 1]. Find functions f and g such that 
X = f(U) and Y = g(U) have the distribution densities p and q with 


0 :x¢(,1] O : |x| >1 
p(x) := ae Wed and q(x):=)x+1:-1<x<0O. 
7 1x€ 1] 1-x: O<x<1 


Problem 4.7. Let X and Y be independent random variables with 


al 
P(X =k} =P{Y=k}= Ei | isl ee 
How is X + Y distributed? 


Problem 4.8. The number of customers visiting a shop per day is Poisson distributed 
with parameter A > 0. The probability that a single customer buys something equals p 
for a given O < p < 1. Let X be the number of customers per day buying some goods. 
Determine the distribution of X. 

Remark: We assume that the decision whether or not a single customer buys 
something is independent of the number of daily visitors. 

A different way to formulate the above question is as follows: let Xo, X;,... be 
independent random variables with P{Xp = 0} = 1, 


P{Xj=1}=p and P{X;=O}=1-p, j=1,2,..., 
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for a certain p ¢ [0,1]. Furthermore, let Y be a Poisson-distributed random variable 
with parameter A > 0, independent of the X;. Determine the distribution of 


Y 
X:= 0X). 


j=0 
Hint: Use the “infinite” version of the law of total probability as stated in Problem 2.4. 


Problem 4.9. Suppose X and Y are independent and exponentially distributed with 
parameter A > 0. Find the distribution densities of X - Y and X/Y. 


Problem 4.10. Two random variables U and V are independent and uniformly distrib- 
uted on [0, 1]. Given n « N, find the distribution density of U+ nV. 


Problem 4.11. Let X and Y be independent random variable distributed according to 
Pois, and Pois,, respectively. Given n « No and some k « {0,..., n}, prove 
A 


n A k n-k 
PIX =k|X+Y=n}= (1) (qa) (5) = Bnp({k}) 
with p = aa 


Reformulation of the preceding problem: An owner of two stores, say store A and 
store B, observes that the number of customers in each of these stores is independent 
and Pois, and Pois, distributed. One day he was told that there were n customers in 
both stores together. What is the probability that k of the n customers were in store A, 
hence n— kin store B? 


Problem 4.12. Let X and Y be independent standard normal variables. Show that X/Y 
is Cauchy distributed. 

Hint: Use Corollary 4.7.10 and the fact that the vectors (X, Y), (-X, Y), (X, -Y), and 
(-X, -Y) are identically distributed. Note that the probability distribution of each of 
these two-dimensional vectors is the (two-dimensional) standard normal distribution. 


Problem 4.13. Let X and Y be independent G,-distributed. Find the probability distri- 
bution of X - Y. 

Hint: Compare Example 4.5.5. There we evaluated the distribution of X— Y if p = 5. 
Problem 4.14. Let U and V be as in Example 4.7.2. Find an analytic (or geometric) 
explanation for 


P{U-V<t}=t-tlnt, O<t<1, 


proved in (4.40). 
Hint: Use that the vector (U, V) is uniformly distributed on [0, 1]. 
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Problem 4.15. Suppose X is a random variable with values in (a, b) ¢ R and with dens- 
ity p. Let f from (a, b) — R be (strictly) monotone and differentiable. Give a formula 
for q, the density of f(X). 
Hint: Investigate the cases of decreasing and increasing functions f separately. 
Use this formula to evaluate the density of e* and of e* for a N(0,1)- 


distributed X. 


5 Expected Value, Variance, and Covariance 


5.1 Expected Value 
5.1.1 Expected Value of Discrete Random Variables 


What is an expected value (also called mean value or expectation) of a random vari- 
able? How is it defined? Which property of the random variable does it describe and 
how it can be computed? Does every random variable possess an expected value? 

To approach the solution of these questions, let us start with an example. 


Example 5.1.1. Suppose N students attend a certain exam. The number of possible 
points is 100. Given j = 0,1, ... , 100, let nj be the number of students who achieved j 
points. Now choose randomly, according to the uniform distribution (a single student 
is chosen with probability 1/N), one student. Name him or her w, and define X(w) as 
the number of points that the chosen student achieved. Then X is a random variable 
with values in D = {0,1, ... , 100}. How is X distributed? Since X has values in D, its 
distribution is described by the probabilities 


n; 

p= PIX=j= a, j=0,1,...,100. (5.1) 
As expected value of X we take the average number A of points in this exam. How is A 
evaluated? The easiest way to do this is 


100 100 : 100 


15 ya Sed Lem 
j=0 j-0 J-0 


where the pjs are defined by eq. (5.1). If we write EX for the expected value (or mean 
value) of X, and if we assume that this value coincides with A, then the preceding 
equation says 


100 100 100 
IX =) jpj= > jPIX = j= xj PAX =x}, 
j=0 j=0 j=0 


where the x; = j, j = 0, ... , 100 denote the possible values of X. 


In view of this example, the following definition for the expected value of a discrete 
random variable X looks feasible. Suppose X has values in D = {x;, x2, ...}, and let 
p; = P{X = xj},j =1,2, .... Then the expected value EX of X is given by 


iX = > 47 = eae. P{X = xj}. (5.2) 
jel jel 
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Unfortunately, the sum in eq. (5.2) does not always exist. In order to overcome this 
difficulty, let us recall some basic facts about infinite series of real numbers. 

A sequence (qj);>1 of real numbers is called summable, provided its sequence of 
partial sums (Sy)n>1 with sy, = pare a; converges in R. Then one defines 


co 


> aj = lim sy. 
n-oo 
jel 


If the sequence of partial sums diverges, nevertheless, in some cases we may assign to 
the infinite series a limit. If either limp... Sy = —00 OF liMy_.00 Sn = 00, then we write 
Dj 4 = —00 or D7", aj = oo, respectively. In particular, if a; > 0 for j < N, then the 
sequence of partial sums is nondecreasing, which implies that only two different cases 
may occur: Either }°;"; aj < co (in this case the sequence is summable) or } 5"; aj = oo. 

Let (a;)j>1 be an arbitrary sequence of real numbers. If ae |a;| < oo, then it is 
called absolutely summable. Note that each absolutely summable sequence is sum- 
mable. This is a direct consequence of Cauchy’s convergence criterion. The converse 
implication is wrong, as can be seen by considering ((-1)"/n)n>1. 

Now we are prepared to define the expected value of a non-negative random 
variable. 


Definition 5.1.2. Let X be a discrete random variable with values in {x;, x2, ...} 
for some x; > 0. Equivalently, the random variable X is discrete with X > 0. Then 
the expected value of X is defined by 


EX := } x; P{X =x}. (5.3) 


Remark 5.1.3. Since x; P{X = x;} > 0 for non-negative X, for those random variables 
the sum in eq. (5.3) is always well-defined, but may be infinite. That is, each non- 
negative discrete random variable X possesses an expected value EX « [0, oo]. 


Let us now turn to the case of arbitrary (not necessarily non-negative) random 
variables. The next example shows which problems may arise. 


Example 5.1.4. We consider the probability measure introduced in Example 1.3.6 and 
choose a random variable X with values in Z distributed according to the probability 
measure in this example. In other words, 


| w 


P{X =k} = ke Z\{0}. 


1 
2 (2? 


qa 
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If we try to evaluate the expected value of X by formula (5.2), then this leads to the 
undetermined expression 


| w 


To exclude phenomenons as in Example 5.1.4, we suppose that a random variable has 
to meet the following condition. 


Definition 5.1.5. Let X be discrete with values in {x;,x.,...} c R. Then the 
expected value of X exists, provided that 


E|X| = ) > |x| P{X = xj} < 00. (5.4) 
j=l 


We mentioned above that an absolutely summable sequence is summable. Hence, 


under assumption (5.4), the sum in the subsequent definition is a well-defined real 
number. 


Definition 5.1.6. Let X be a discrete random variable satisfying E|X| < oo. Then its 
expected value is defined as 


As before, x1, X2, ... are the possible values of X. 


Example 5.1.7. We start with an easy example that demonstrates how to compute the 
expected value in concrete cases. If the distribution of a random variable X is defined 


as P{X = -1} = 1/6, P{X = 0} = 1/8, P{X = 1} = 3/8, and P{X = 2} = 1/3, then its expected 
value equals 


=X = (-1)- P{X =-1}+0-P{X = 0} +1-P{X = 1} +2-P{X = 2} 
1 3 7 


2 
-Stit+ ree. 
6 8 3 8 
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Example 5.1.8. The next example shows that EX = oo may occur even for quite nat- 
ural random variables. Thus, let us come back to the model presented in Example 
1.4.39. There we developed a strategy how to win always one dollar in a series of 
games. The basic idea was, after losing a game, next time one doubles the amount in 
the pool. As in Example 1.4.39, let X(k) be the amount of money needed when winning 
for the first time in game k. We obtained 


P{X = 2*-1}=pt-p)F1, k=1,2.... 


Recall that 0 < p < 1is the probability to win a single game. We ask for the expected 
value of money needed to apply this strategy. It follows 


EX =) (2k - 1)P[X = 2-1} =p 9 (2*-1)-p)*. (5.5) 


k=1 k=1 


If the game is fair, that is, if p = 1/2, then this leads to 


because of (2* — 1)/2k + 1as k = oo. This yields EX = oo for all! p < 1/2. 

Let us sum up: if p < 1/2 (which is the case in all provided games), the ob- 
tained result tells us that the average amount of money needed, to use this strategy, 
is arbitrarily large. The owners of gambling casinos know this strategy as well. There- 
fore, they limit the possible amount of money in the pool. For example, if the largest 
possible stakes is N dollars, then the strategy breaks down as soon as one loses n 
games for some n with 2” > N. And, as our calculations show, on average this always 
happens. 


Remark 5.1.9. If p > 1/2, then the average amount of money needed is finite, and it 
can be calculated by 


aX = p (2-11 - p)** = 2p Y (2 - 2p)k -p S01 - p)* 
k=1 k=0 k=0 
2p p 2p 1 
1 , 
1-(2-2p) 1-(1-p) 2p-1 2p -1 


5.1.2 Expected Value of Certain Discrete Random Variables 


The aim of this section is to compute the expected value of the most interesting 
discrete random variables. We start with uniformly distributed ones. 


1 Ifp < 1/2 then 1 - p > 1/2, hence the sum in eq. (5.5) becomes bigger and, therefore, it also diverges. 
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Proposition 5.1.10. Let X be uniformly distributed on the set {x;, ... , xy} of real 
numbers. Then it follows that 


1 N 
oa ar (5.6) 


That is, -X is the arithmetic mean of the x;s. 


Proof: This is an immediate consequence of P{X = x;} = 1/N, implying 


N N 1 
IX =) x;-PIX =x} = > %- me 
jel jel 


Remark 5.1.11. For general discrete random variables X with values xj, x2, ..., their 
expected value EX may be regarded as a weighted (the weights are the pjs) mean of 
the xs. 


Example 5.1.12. Let X be uniformly distributed on {1, ... ,6}. Then X is a model for 
rolling a fair die. Its expected value is, as is well known, 


1+---+6 21 7 
iX = ae a 


Next we determine the expected value of a binomial distributed random variable. 


Proposition 5.1.13. Let X be binomial distributed with parameters n and p. Then we get 


iX=np. (5.7) 


Proof: The possible values of X are 0, ... ,n. Thus, it follows that 


. “ “ n\ k n-k 
X= oR PK= = Dk (fp (1-p) 


= n! k n-k 
> ee? P) 


7 (n- 1)! k-1 n-k 
"Pd aw? le 
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Shifting the index from k — 1 to k in the last sum implies 


This completes the proof. a 


Remark 5.1.14. The previous result tells us the following: if we perform n independent 
trials of an experiment with success probability p, then on the average we will observe 
np times success. 


Example 5.1.15. One kilogram ofa radioactive material consists of N atoms. The atoms 
decay independently of each other and, moreover, the lifetime of each of the atoms is 
exponentially distributed with some parameter A > 0. We ask for the time To > 0, at 
which, on the average, half of the atoms are decayed. To is usually called radioactive 
half-life. 

Answer: If T > O, then the probability that a single atom decays before time T is 
given by 


p(T) = E,((0, T]) =1-e"”. 
Since the atoms decay independently, the number of atoms decaying before time T 


is By pcr)-distributed. Therefore, by Proposition 5.1.13, the expected value of decayed 
atoms equals N - p(T) = N(1 - e*”). Hence, To has to satisfy 


N 
NQ- e420) = oe 


leading to To = 1n2/A. Conversely, if we know Tp and want to determine A, then A = 
In 2/Tp. Consequently, the probability that a single atom decays before time T > 0 can 
also be described by 

E,([0, T]) =1-e77 M27 = yg 7/70 | 


Next, we determine the expected value of Poisson distributed random variables. 


Proposition 5.1.16. For some A > 0, let X be distributed according to Pois,. Then it 
follows that EX = A. 
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Proof: The possible values of X are 0,1, .... Hence, the expected value is given by 


oo Ae 


; co co Ak A 
X= )Uk-PIX=K}= Dike” nih Dee pe? 
k=0 ke1 


which transforms by a shift of the index to 


This proves the assertion. o 


Interpretation: Proposition 5.1.16 explains the role of the parameter A in the definition 
of the Poisson distribution. Whenever certain numbers are Poisson distributed, then 
A > Ois the average of the observed values. For example, if the number of accidents per 
week is known to be Pois,-distributed, then the parameter A is determined by the aver- 
age number of accidents per week in the past. Or, as we already mentioned in Example 
4.6.6, the number of raisins in a piece of p pounds of dough is Pois,,-distributed, where 
A is the proportion of raisins per pound dough, hence Ap is the average number of 
raisins per p pounds. 


Example 5.1.17. Let us once more take a look at Example 4.6.15. There we considered 
light bulbs with E,-distributed lifetime. Every time a bulb burned out, we replaced it 
by a new one of the same type. It turned out that the number of necessary replace- 
ments until time T > 0 was Pois,7-distributed. Consequently, by Proposition 5.1.16, on 
average, until time T we have to change the light bulbs A T times. 


Finally, we compute the expected value of a negative binomial distributed random 
variable. According to Definition 1.4.41, a random variable X is B,, ,-distributed if 


I 
P{X =k} = (,_ 1)P "q-—p)", k=nn+1,... 
or, equivalently, if 
PO =k+n}= ("DF k=0,1, ... (5.8) 


Proposition 5.1.18. Suppose X is B,,,-distributed for some n > 1 and p « (0, 1). Then 


iX = 


n 
p . 
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Proof: Using eq. (5.8), the expected value of X is computed as 


iX = ykP(X =} = Yk +n) PL = ken} 
k=n k=0 


“(=n “(=n 
= p" k )@- nt np" )@-vt. 
re XC 


To evaluate the two sums in eq. (5.9) we use Proposition A.5.2, which asserts 


1 (-n) x 
aap" GG)? 
k=0 


for |x| < 1. Applying this with x = p-1 (recall 0 < p< 1), 


n — (-n k_ n 1 = 
is L(r)e wen" Geb” 


Next we differentiate eq. (5.10) with respect to x and obtain 
-n a fe 
= k k-1 ; 
(1 of xn » ( k ) x 
k=1 
which, multiplying both sides by x, gives 
—nx “fon 
a - 
(1+ x)m1 oy ( k ) ‘ 


Letting x = p — 1in eq. (5.12), the first sum in eq. (5.9) becomes 


Yk({)o gt 


k +=)" D 


k=1 


Finally, we combine eqs. (5.9), (5.11), and (5.13) and obtain 


D D 


_ n(l-p) n 


1X 


as claimed. 


(5.9) 


(5.10) 


(5.11) 


(5.12) 


(5.13) 


Remark 5.1.19. Proposition 5.1.18 asserts that on average the nth success occurs in 
trial n/p. For example, rolling a die, on average, the first appearance of number “6” 


will be in trial 6, the second in trial 12, and so on. 
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Corollary 5.1.20. IfX is geometric distributed with parameter p, then 


1 
iX=-. (5.14) 

p 
Proof: Recall that Gy, = Bip» hence X is By p distributed, and EX = 3 by Proposition 
5.1.18. - 


Alternative proof of Corollary 5.1.20: Suppose X is G,-distributed. Then we write 


aX =p) k(1—p)t =p Dok +0 - p)* 


k=1 k=0 
=(1-p) )>kp(—p)* +p S00 —p)* 
k=0 k=0 
= (1-p)EX+1. 


Solving this equation with respect to EX proves eq. (5.14) as asserted. Observe that 
this alternative proof is based upon the knowledge of EX < oo. Otherwise, we could 
not solve the equation with respect to EX. But, because of 0 < p < 1, this fact is an 
easy consequence of 


iX =p > ka - py < oo, 
k=1 


5.1.3 Expected Value of Continuous Random Variables 


Let X be a continuous random variable with distribution density p, that is, if t ¢ R, 
then 


t 
PIX < t}= [ poe. 


How to define EX in this case? 

To answer this question, let us present formula (5.3) in an equivalent way. Suppose 
X maps Q into a set D c R, which is either finite or countably infinite. Let p : R > [0,1] 
be the probability mass function of X introduced in eq. (3.4). Then the expected value 
of X may also be written as 


EX = > xp(x). 


xeR 


In this form, the preceding formula suggests that in the continuous case the sum 
should be replaced by an integral. This can made more precise by approximating 
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continuous random variables by discrete ones. But this is only a heuristic explan- 
ation; for a precise approach, deeper convergence theorems for random variables 
are needed. Therefore, we do not give more details here, we simply replace sums by 
integrals. 

Doing so, for continuous random variables the following approach for the defini- 
tion of EX might be taken. If p : R > [0, 00) is the distribution density of X, set 


co 


OX r= i x p(x) dx. (5.15) 


—oo 


However, here we have a similar problem as in the discrete case, namely that the in- 
tegral in eq. (5.15) need not exist. Therefore, let us give a short digression about the 
integrability of real functions. 

Let f : R > R bea function such that for all a < b the integral /, : f(x)dx isa 
well-defined real number. Then 


oo b 
[ feaac= jim f poo.ax, (5.16) 
Ms bro 


provided both limits on the right-hand side of eq. (5.16) exist. In this case we call f 
integrable (in the Riemann sense) on R. 

If f(x) > 0, x € R, then the limit lim G00 ie f(x) dx always exists in a generalized 
sense, that is, it may be finite (then f is integrable) or infinite, then this is expressed 
by [fd dx = 0. 

If [° |f)| dx < oo, then f is said to be absolutely integrable, and as in the case 
of infinite series, absolutely integrable function are integrable. Note that x + sin x/x 
is integrable, but not absolutely integrable. 

After this preparation, we come back to the definition of the expected value for 
continuous random variables. 


Definition 5.1.21. Let X be a random variable with distribution density p. If 
p(x) = 0 for x < 0, or, equivalently, P{X > 0} = 1, then the expected value of 
X is defined by 


co 


EX := | xp(x)dx. (5.17) 
| 


Observe that under these conditions upon p or X, we have x p(x) > 0. Therefore, the 
integral in eq. (5.17) is always well-defined, but might be infinite. In this case we write 
1X =o. 
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Let us turn now to the case of R-valued random variables. The following example 
shows that the integral in eq. (5.15) may not exist, hence, in general, without an 
additional assumption the expected value cannot be defined by eq. (5.15). 


Example 5.1.22. A random variable X is supposed to possess the density (check that 
this is indeed a density function) 


pea O :-1<x<l1 
sai xied 
If we try to evaluate EX by virtue of eq. (5.15), then, because of 
oo b 1 b d -1 eo 
x 

[ xrcoax - lim [vmax- 5 lim [+ lim = 
ae 2 | boco X = a>-co x 

—0o a 1 a 

b a 
=—| lim / lim =0O- 00, 
boo x a-oo XxX 


we observe an undetermined expression. Thus, there is no meaningful way to intro- 
duce an expected value for X. 


We enforce the existence of the integral by the following condition. 
Definition 5.1.23. Let X be a (real-valued) random variable with distribution dens- 


ity p. We say the expected value of X exists, provided p satisfies the following 
integrability condition?: 


E|X| := i! |x| p(x) dx < co. (5.18) 


Condition (5.18) says nothing but that f(x) := x p(x) is absolutely integrable. Hence, 
as mentioned above, f is integrable, and the integral in the following definition is 
well-defined. 


Definition 5.1.24. Suppose condition (5.18) is satisfied. Then the expected value 
of X is defined by 


DI 9= f rveax. 


2 At this point it is not clear that the right-hand integral is indeed the expected value of |X|. This will 
follow later on by Proposition 5.1.36. Nevertheless, we use this notation before giving a proof. 
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5.1.4 Expected Value of Certain Continuous Random Variables 


We start with computing the expected value of a uniformly distributed (continuous) 
random variable. 


Proposition 5.1.25. Let X be uniformly distributed on the finite interval I = [a, B]. Then 


at+B 
2 


{X = 


that is, the expected value is the midpoint of the interval I. 


Proof: The distribution density of X is the function p defined as p(x) = (6 - a)! if 
x ¢ I, and p(x) = Oif x ¢ I. Of course, X possesses an expected value,? which can be 
evaluated by 


oo B 
2758 ne? 
x= f xpaax- [ x doe 1 B he a _atB 
J J Bp-a B-al2]|, 2 B-a 2 
This proves the proposition. a 


Next we determine the expected value of a gamma distributed random variable. 


Proposition 5.1.26. Suppose X is T'a,,-distributed with a,B > 0. Then its expected 
value is 


iX=ap. 


Proof: Because of P{X > 0} = 1, its expected value is well-defined and computed by 


co 


iX = [ x00ax- : [xoPrerrar 
0 


[o) 


1 foe} 
= J xP e*I4 dy, (5.19) 
(0) 


The change of variables u := x/a transforms eq. (5.19) into 


ght a Rai ght! 
1X = —~—— | we'"du= ———.-T(6+1)=aB, 
aT) arg (e*V-aP 
0) 
where we used eq. (1.48) in the last step. This completes the proof. a 


3 |x|p() is bounded and nonzero only on a finite interval. 
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Corollary 5.1.27. Let X be E)-distributed for a certain A > 0. Then 


aX =<, 
A 


Proof: Note that Ey = Tj. a 
Example 5.1.28. The lifetime of a special type of light bulbs is exponentially dis- 
tributed. Suppose the average lifetime constitutes 100 units of time. This implies 
A = 1/100, hence, if X describes the lifetime, then 


P(X < t}=1-e7%, edo. 


For example, the probability that the light bulb burns longer than 200 time units 
equals 


P{X > 200} = e7 200/00 ~ e-2 = 9.135335 - -- 


Remark 5.1.29. If we evaluate in the previous example 


P{X > EX} = P{x > 100} =e, 


then we see that in general P{X > EX} # 1/2. Thus, in this case, the expected value is 
different from the median of X defined as a real number M satisfying P{X > M} > 1/2 
and P{X < M} > 1/2. In particular, if Fy satisfies the condition of Proposition 4.4.6, 
then the median is uniquely determined by M = Fy'(1/2), ie., by P{X < M} = 1/2. It 
is easy to see that the above phenomenon appears for all exponentially distributed 
random variables. Indeed, if X is E,-distributed, then M = In2/A while, as we saw, 
tX = 1/A. 


Corollary 5.1.30. If X is y2-distributed, then 


iX =n. 


Proof: Since x? =I z,,/2, by Proposition 5.1.26 follows that EX = 2-n/2 =n. a 


Which expected value does a beta distributed random variable possess? The next 
proposition answers this question. 


Proposition 5.1.31. Let X be Bg,g-distributed for certain a, B > 0. Then 


R 
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Proof: Using eq. (1.60), by eq. (5.17) we obtain, as asserted, 


1 
TY — 1 _ ya-174 _ y)b-1 
tX wap | ** (1-x)?™ dx 
4 
= 1 a B-1 _ Ba+1,B) | a 
Ba. B) [xe 5G) ae a 


Example 5.1.32. Suppose we choose independently n numbers x, ... , X, uniformly 
distributed on [0, 1] and order them by their size. Then we get the order statistics 0 < 
x} S$ --- <x} < 1. According to Example 3.7.8, if1 < k < n, then the number x; is 
Bn-k+1- distributed. Thus Proposition 5.1.31 implies that the average value of x;, that 
is, of the kth largest number, equals 


kk 
k+(n-k+1)) n4+1° 


In particular, the expected value of the smallest number is a while that of the largest 
one is *. 

Does a Cauchy distributed random variable possess an expected value? Here we obtain 
the following. 


Proposition 5.1.33. If X Cauchy distributed, then EX does not exist. 


Proof: First observe that we may not use Definition 5.17. The distribution density of X 


is given by p(x) = 3 - z45, hence, it does not satisfy p(x) = 0 for x < 0. Consequently, 


we have to check whether condition (5.18) is satisfied. Here we get 


co co 


1 |x| 2 x 1 00 
a|X| = dx = dx = — [In(Q1+x2)]~ = 00. 
al 1 [ss “fs a ( Mo ic 
can 0 
Thus, E|X| = oo, that is, X does not possess an expected value. | 


Finally, we determine the expected value of normally distributed random variables. 


Proposition 5.1.34. IfX is N(, o*)-distributed, then 


EX =. 
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Proof: First, we check whether the expected value exists. The density of X is given by 
eq. (1.47), hence 


1 r 2/942 
|X| Py,o@) dx = —— / je dx 
/ Pua J 21 Oo 


1 2 
= — J2ou + ple“ du 
= fi HI 
< 06 — we dus yl fe du < oe 
vn , 
0 


where we used the well-known fact‘ that for all k « No 


co 

92 
[ke du oo. 
0 


The expected value EX is now evaluated in a similar way by 


2X = | xpyo(x)dx = —— jae 20” dx 
i Pu,o Tne 


‘i 2 
= -v*/2 
=—— ov+pe dv 
ya | ( y) 


1 , -y2/2 1 / -y2/2 
=o0— | ve dv +~— le dv. 5.20 
a / Me (5.20) 


The function f(v) := ve Vis odd, that is, f(-v) = —f(v), thus | ee f(v) dv = O, and the 
first integral in eq. (5.20) vanishes. To compute the second integral use Proposition 
1.6.6 and obtain 


il fr 2 1 
— | ee dqvan—— Jaen. 
. al B J 21 B 


This completes the proof. o 


Remark 5.1.35. Proposition 5.1.34 justifies the notation “expected value” for the para- 
meter yin the definition of the probability measure \V(, 0”). 


4 See either [Spi08] or use that for all k > 1 one has sup,.9 uke < 00, 
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5.1.5 Properties of the Expected Value 


In this section we summarize the main properties of the expected value. They are 
valid for both discrete and continuous random variables. But, unfortunately, within 
the framework of this book it is not possible to prove most of them in full generality. 
To do so one needs an integral (Lebesgue integral) /, fdP of functions f : Q > R for 
some probability space (Q, A, P). Then EX = {, XdP, and all subsequent properties of 


Xb 


7X follow from those of the (Lebesgue) integral. 


Proposition 5.1.36. The expected value of random variables owns the following 
properties: 


(1) 


(2) 


(3) 


(4) 


(5) 


The expected value of X only depends on its probability distribution Px, not on the 
way how X is defined. That is, if X 7 for two random variables X and Y, then 
1X = RY. 

If X is with probability 1 constant, that is, there is some c € R with P(X = c) = 1, 
then EX =c. 

The expected value is linear: let X and Y be two random variables possessing an 
expected value and let a, b « R. Then E(aX + bY) exists as well and, moreover, 


i(aX + bY) =aEX+bEY. 


Suppose X is a discrete random variable with values in x,, X2, .... Given function 
f from R to R, the expected value Ef (X) exists if and only if 


Y= If G)| PRK = 4) < 00, 


i=1 


and, moreover, then 


Ef(X) = D> f(x) P(X = xi) . (5.21) 


i=1 


If X is continuous with density p, then for any measurable function f : R > R the 
expected value Ef (X) exists if and only if 


/ FG) p(x) dx < 00 . 


In this case it follows that 


if (X) = / f(x) p@&) dx . (5.22) 
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(6) Forindependent X and Y possessing an expected value, the expected value of X-Y 
exists as well and, moreover, 


a[X Y] = EX- EY. 


(7) Write X < Y provided that X(w) < Y(w) for allw «€ Q. If in this sense |X| < Y for 
some Y with EY < oo, then E|X| < o and, hence, EX exists. 

(8) Suppose EX and EY exist. Then X < Y implies EX < EY. In particular, if X > 0, 
then EX > 0. 


Proof: We only prove properties (1), (2), (4), and (8). Some of the other properties are 
not difficult to verify in the case of discrete random variables, for example, (3), but 
because the proofs are incomplete, we do not present them here. We refer to [Bil12], 
[Dur10] or [Kho07] for the proofs of the remaining properties. 

We begin with the proof of (1). If X and Y are identically distributed, then either 
both are discrete or both are continuous. If they are discrete, and Px(D) = 1 for an at 
most countably infinite set D, then X Z Y implies Py(D) = 1. Moreover, by the same 
argument Px({x}) = Py({x}) for any x « D. Hence, in view of Definition 5.1.2, EX ex- 
ists if and only if EY does so. Moreover, if this is valid, then EX = EY by the same 
argument. 


In the continuous case we argue as follows. Let p be a density of X. By X 4 vit 
follows that 


t 


/ po) dx = Px((-ce, t) = Py(-o, t), teR. 


—oo 


Thus, p is also a distribution density of Y and, consequently, in view of Definition 
5.1.23, the expected value of X exists if and only if this is the case for Y. Moreover, by 
Definition 5.1.24 we get EX = EY. 

Next we show that (2) is valid. Thus, suppose P{X = c} = 1for somec ¢ R. Then X 
is discrete with Py(D) = 1 where D = {c}, and by Definition 5.1.2 we obtain 


tX=c-P{X=ch=c-l=c 


as asserted. 
To prove (4) we assume that X has values in D = {x,, x2, ...}. Then Y = f(X) maps 
into f(D) = {y1, v2, ...}. Given j « N let D; = {x; : f(x) = y;}. Thus, 


P(Y = yj} = PX e Dj} = >> PX = x}. 


xeDj 
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Consequently, since Dj n Dy = @ ifj # j’, by Us, D; = Dwe get 


AY] = >> PLY =yj} = >> >> tylP&X = x} 
jel 


j=l xeD; 


= 55 IF@dIPEX = x} = DOF Ga) IPEX = 4} = E/X1. 
i=l 


j=l xeD; 


This proves the first part of (4). The second part follows by exactly the same arguments 
(replace |y;| by y;). Therefore, we omit its proof. 

We finally prove (8). To this end we first show the second part, that is, EX > 0 for 
X > 0.IfX is discrete, then X attains values in D, where D consists only of non-negative 
real numbers. Hence, xjP{X 7 xj} > 0, which implies EX > 0. If X is continuous, 
in view of X > 0, we may choose its density p such that p(x) = Oif x < 0. Then 
aX = fo p(xddx > 0. 

Suppose now X < Y. Setting Z = Y — X, by the first step follows EZ > 0. But, prop- 
erty (3) implies EZ = EY — EX, from which we derive EX < EY as asserted. Note that 
by assumption EX and EY are real numbers, so that EY — EX is not an undetermined 
expression. | 


Remark 5.1.37. Properties (4) and (5) of the previous proposition, applied with f(x) = 
|x|, lead to 


co 
co 


IKI= Do iPX= xh or EWXI= [ ixipOdex, 


j=l —oo 


as we already stated in conditions (5.4) and (5.18). 


Corollary 5.1.38. If EX exists, then shifting X by u = EX, it becomes centralized (the 
expected value is zero). 


Proof: If Y = X — p, then properties (2) and (3) of Proposition 5.1.36 imply 


SY = E(X -p) = EX-Eyv=p-p=0, 


as asserted. a 


An important consequence of (8) in Proposition 5.1.36 reads as follows. 


Corollary 5.1.39. If EX exists, then 


EX|<E 


XI. 
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Proof: For all w ¢ Q follows that 
-|X@)| < X() < |X()|, 


that is, we have —-|X| < X < |X|. We apply now (3) and (8) of Proposition 5.1.36 and 
conclude that 


- E|X| = E(-|X]) < EX < E|X|. (5.23) 


Since ja| < c for a,c € R is equivalent to -c < a < c, the desired estimate is a 
consequence of inequalities (5.23) with a = EX andc = E|X|. o 


We now present some examples that show how Proposition 5.1.36 may be used to 
evaluate certain expected values. 


Example 5.1.40. Suppose we roll n fair dice. Let S, be the sum of the observed values. 
What is the expected value of S;, ? 

Answer: If X; denotes the value of die j, then Xi, ... , X, are uniformly distributed 
on {1, ... ,6} with EX; = 7/2 and, moreover, S, = X; + - - - + Xn. Thus, property (3) lets 
us conclude that 


tSy = E(X, + ++ + + Xp) = EX, +--+ + EX, = < 
Example 5.1.41. In Example 4.1.7 we investigated the random walk of a particle on Z. 
Each time it jumped with probability p either one step to the right or with probability 
1-p one step to the left. S, denoted the position of the particle after n steps. What is 
the expected position after n steps? 

Answer: We proved that S, = 2Y, — n with a By,»-distributed random variable Yj. 
Proposition 5.1.13 implies EY, = np, hence the linearity of the expected value leads to 


ES, = 2EY, —n = 2np-n=n(2p-1). 


For p = 1/2 we obtain the (not very surprising) result ES, = 0. 

Alternative approach: If X; is the size of jump j, then P{X; = -1} = 1— p and P{X; = 
+1} = p. Hence, EX; = (-1)(1- p) +1- p = 2p - 1, and because of S, = X, + - - - + Xn we 
get ES, = n EX, = n(2p - 1) as before. 


The next example demonstrates how property (4) of Proposition 5.1.36 may be used. 


Example 5.1.42. Let X be Pois,-distributed. Find EX. 
Solution: Property (4) of Proposition 5.1.36 implies 


aid sid Ak said Ae1 
yy2 _ 2 _ =) 2 A _ -A 
2X eo. P{X =k} ae ae AD kaye . 
= =1 


k=1 
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We shift the index of summation in the right-hand sum by 1 and get 


Avik+D Ee nh Dh 2. ‘ 
k=0 k=0 k=0 


By Proposition 5.1.16, the first sum coincides with AEX = A’, while the second one gives 
A Pois,(No) = A-1= A. Adding both values leads to 


IX? =) +A. 


The next example rests upon an application of properties (3), (4) and (6) in Proposi- 
tion 5.1.36. 


Example 5.1.43. Compute EX? for X being B,,)-distributed. 

Solution: Let X;, ... , Xn be independent B,)-distributed random variables. Then 
Corollary 4.6.2 asserts that X = X,; + - - - + X, is By p-distributed. Therefore, it suffices 
to evaluate EX? with X = X,+ - - - +X,. Thus, property (3) of Proposition 5.1.36 implies 


non 
EX? = E(Xi+--- +X => >> iX;X; . 


i=l jl 


Ifi # j, then X; and X; are independent, hence property (6) applies and yields 


uX;X; = EX;- EX; =p-p =p’. 


For i = j, property (4) gives 


EX; = 07 - P{X; = 0} + 1°- P{X; = 1} =p. 


Combining both cases leads to 


n 
aX? =) EX;-EXj+ ) EX? =n(n-1)p?+np =n’ p’+np(1-p). 
if} jel 


Example 5.1.44. Let X be Gp-distributed. Compute EX’. 
Solution: We claim that 


(5.24) 


To prove this, let us start with 


aX? = \ PPX =k}=p >> RO-p). (5.25) 
k=1 


k=1 
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We evaluate the right-hand sum by the following approach. If |x| < 1, then 


Differentiating both sides of this equation leads to 


1 ee 
@=a7 = > kx. 
k=-1 


Next we multiply this equation by x and arrive at 
x co 
—— kx, 
(1 - x)? oy 


Another time differentiating of both functions on {x ¢ R : |x| < 1} implies 


1 2x = 2 )k-1 
~y2 * Woe er AS 
(1-x)* (-x) mr 


If we use the last equation with x = 1 — p, then by eq. (5.25) 


os 1 2(1 - p) 2 
= >| aap a Pp’ 


as we claimed in eq. (5.24). 


In the next example we us property (5) of Proposition 5.1.36. 


Example 5.1.45. Let U be uniformly distributed on [0, 1]. Which expected value does 
VU possess? 
Solution: By property (5) it follows that 


1 


0) 


1 1 
whale) ea (are) 


Here p = 191; denotes the distribution density of U. 
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Another approach is as follows. Because of 
F g(t) = P{VU <th=P{U <P} =? 


for 0 < t < 1, the density q of VU is given by q(x) = 2x, 0 < x < 1, and q(x) = 0, 
otherwise. Thus, 


; ey 2 
VO [ x2xax=2| 5] ==, 
Sle 3 
0 
Let us present now an interesting example called Coupon collector’s problem. It was 
first mentioned in 1708 by A. De Moivre. We formulate it in a present-day version. 


Example 5.1.46. A company produces cornflakes. Each pack contains a picture. We 
assume that there are n different pictures and that they are equally likely. That is, 
when buying a pack, the probability to get a certain fixed picture is 1/n. How many 
packs of cornflakes have to be bought on the average before one gets all possible n 
pictures? 

An equivalent formulation of the problem is as follows. In an urn are n balls 
numbered from 1 to n. One chooses balls out of the urn with replacement. How many 
balls have to be chosen on average before one observes all n numbers? 

Answer: Assume we already have k different pictures for some k = 0,1, ... ,n-1. 
Let X; be the number of necessary purchases to obtain a new picture, that is, to get 
one which we do not have. Since each pack contains a picture, 


P{Xp = 1} =1. 
If k > 1, then there are still n — k pictures that one does not possess. Hence, Xx is 


geometric distributed with success probability px = (n- k)/n. If Sn = Xo + + + > + Xn-1, 
then Sy is the totality of necessary purchases. By Corollary 5.1.20 we obtain 


Note that EXo = 1, thus the previous formula also holds in this case. Then the linearity 
of the expected value implies 


1 
1S, =1+ EX, +--+ +EXp1=14+—+---4+ 
Pi Pn-1 
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Consequently, on average, one needs n )-y_, k purchases to obtain a complete collec- 
tion of all pictures. 

For example, if n = 50, on average, we have to buy 225 packs, if n = 100, then 519, 
for n = 200, on average, there are 1176 purchases necessary, if n = 300, then 1885, if 
n = 400 we have to buy 2628 packs, and, finally, ifn = 500 we need to buy 3397 ones. 


Remark 5.1.47. As n -+ oo, the harmonic series )°y_, ; behaves like Inn. More 
precisely (cf. [Lag13] or [Spi08], Problem 12, Chapter 22) 


ae oe 
Jim [yo jinn ays (5.26) 


where y denotes Euler’s constant, which is approximately 0.57721. Therefore, for 
large n, the average number of necessary purchases is approximately n [Inn + y]. For 
example, if n = 300, then the approximative value is 1884.29, leading also to 1885 
necessary purchases. 


5.2 Variance 


5.2.1 Higher Moments of Random Variables 


Definition 5.2.1. Let n > 1 be some integer. A random variable X possesses an 
nth moment, provided that E|X|" < oo. We also say X has a finite absolute nth 
moment. If this is so, then EX” exists, and it is called nth moment of X. 


Remark 5.2.2. Because of |X|" = |X"|, the assumption E|X|" < oo implies the existence 
of the nth moment EX". 

Note that a random variable X has a first moment if and only if the expected value 
of X exists, cf. conditions (5.4) and (5.18). Moreover, then the first moment coincides 
with EX. 


Proposition 5.2.3. Let X be either a discrete random variable with values in {x1, X2, ...} 
and with p; = P{X = x;}, or let X be continuous with density p. If n > 1, then 


q|X|" = »; Ixj/"-p; or E|X|" = i |x|" p(x) dx. (5.27) 
jel —co 
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Consequently, X possesses a finite absolute nth moment if and only if either the sum or 
the integral in eq. (5.27) are finite. If this is satisfied, then 


EX" = Sx? py or ax" = fx" pode. 
fA ae 


Proof: Apply properties (4) and (5) in Proposition 5.1.36 with f(x) = |x|" or f(x) = x", 
respectively. a 


Example 5.2.4. Let U be uniformly distributed on [0, 1]. Then 


1 
E|U|" = un = f x"ax- 
10) 


For the subsequent investigations, we need the following elementary lemma. 
Lemma 5.2.5. If0 < a< B, then for allx > 0 
x ex? 41, 
Proof: If 0 < x <1, by x? > 0 follows that 
x <1< x8 41, 


and the inequality is valid. 
If x > 1, then a < Bimplies x* < x8, hence also for those x we arrive at 


x xB < xh 41, 
which proves the lemma. a 


Proposition 5.2.6. Suppose a random variable X has a finite absolute nth moment. 
Then X possesses all mth moments withm < n. 


Proof: Suppose E|X|" < co and choose an m < n. For fixed w ¢ O we apply Lemma 
5.2.5 with a = m, B = nand x = |X(w)|. Doing so, we obtain 


[X(w)|" < |X(w)|"+1, 


and this being true for all w ¢ © implies |X|” < |X|" + 1. Hence, property (7) of 
Proposition 5.1.36 yields 


|X|" < E(|X|" +1) = E|X|"+1< 00, 
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Consequently, as asserted, X possesses also an absolute mth moment. 


Remark 5.2.7. There exist much stronger estimates between different absolute mo- 
ments of X. For example, Hoélder’s inequality asserts that for any 0 <a< B 


— 


E|x|*]"" < [ axe] 
| 


The case n = 2 and m = 1 in Proposition 5.2.6 is of special interest. Here we get the 
following useful result. 


Corollary 5.2.8. If X possesses a finite second moment, then E|X| < oo, that is, its 
expected value exists. 


Let us state another important consequence of Proposition 5.2.6. 


Corollary 5.2.9. Suppose X has a finite absolute nth moment. Then for any b <« IR we 
also have E|X + b|" < 0. 


Proof: An application of the binomial theorem (Proposition A.3.7) implies 


n 


n n_ n k n-k 
IX + b|" < (|X| + bl) -> (i) [XI bl. 


k=0 


Hence, using properties (3) and (7) of Proposition 5.1.36, we obtain 


n 
n 
EIX+bI" <)> {ier UIX|K < 00. 
k=0 


Note that Proposition 5.2.6 implies E|X|* < oo for all k < n. This ends the proof. a 


Example 5.2.10. Let X be I',,g-distributed with parameters a, 6 > 0. Which moments 
does X possess, and how can they be computed? 
Answer: In view of X > 0 it suffices to investigate EX". For all n > 1 it follows that 


tX" = u f n+B-1 ,-x/a qy — ant ' n+B-1 .- 
aX TB) [> e dx FT) fy e’ dy 
_ T(B +n) — 7 a 
=a T® =a" (B+n-1)(6+n-2) (B+). 


In particular, X has moments of any order n > 1. 
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In the case of an E)-distributed random variable X we have a = 1/A and B = 1, 
hence 
n! 


7yn _ 
1X = an: 


Example 5.2.11. Suppose a random variable is t,-distributed. Which moments does X 
possess? 

Answer: We already know that a ¢t,-distributed random variable does not pos- 
sess a first moment. Recall that X is t;-distributed if it is Cauchy distributed. And in 
Proposition 5.1.33 we proved E|X| = oo for Cauchy distributed random variables. 

But what can be said if n > 2? 

According to Definition 4.7.6, the random variable X has the density p with 


rn 2 HA 
p(x) = ae (1+ ) , xeR. 
2 


If m < N, then 


r (eo) 2 -n/2-1/2 
a|X |" = —22- / |x| (1+ =} dx. 
Jnr T (3) n 


—oo 


Hence, X has an mth moment, if and only if the integral 


oo y2\ 2-2 i y2\ 2-2 
/ |x| (1+ =) dx =2 / x™ (1+ ~} dx (5.28) 
0 


—oco 


is finite. Note that 


. n+1 x? oe : -2 1 ae n/2+1/2 
lim x ars = lim [x *+— =n ; 


X—> co X—- co 
thus, there are constants 0 < c; < C2 (depending on n, but not on x) such that 


c x2 —n/2-1/2 c 
1 m 2 
<X (1+ =") < ame (5.29) 


xn-m+1 n 


for large x, that is, if x > xo for a suitable xo € R. 

Recall that tag x “dx < oo if and only if a > 1. Having this in mind, in view of 
eq. (5.28) and by the estimates in (5.29) we get E|X|" < oo if and only ifn-m+1> 1, 
that is, if and only ifm <n. 

Summing up, a t,-distributed random variable has moments of order 1, ... ,n-1, 
but no moments of order greater than or equal n. 
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Finally, let us investigate the moments of normally distributed random variables. 


Example 5.2.12. How do we calculate EX" for an \(0, 1)-distributed random vari- 
able? 
Answer: Well-known properties of the exponential function imply 


1 2 
|X|" = —— / Ixi"e* ? dx Tz le e* 2 dx < 00 
Jn 7 
—~oo ie) 


for all n € N. Thus, a normally distributed random variable possesses moments of any 
order. These moments are evaluated by 


1 r 2 
ax" = f x"poa@ddx= — | xe*Pax. 


If nis an odd integer, then x — x” eX 2 ig an odd function, hence EX” = 0 for these n. 
Therefore, it suffices to investigate even n = 2m with m « N. Here we get 


1 2 
EX7" = 2. sz | ere 2 dx, 
ane 


which, by the change of variables y := x?/2, thus x = ./2y with dx = A y 12 dy, 
transforms into 


co 


mem = Be gm [pe ey dy = an r (m + 1 
Jim Jn 2)° 
0 


By I'(1/2) = ./m and an application of eq. (1.48) we finally obtain 
X?m “1 (m+5) - = (m aG ) 
: Jn 2) Vn 2 2 


2m 1 3 3 
=a (3) (m5) F("-3) 
= 2™T (1/2) - 1/2 -3/2 - - - (m—-1/2) 

Vit 
= (2m-1)(2m—-3) +--+ 3-1:=(2m-I)!!. 


5.2.2 Variance of Random Variables 


Let X be a random variable with finite second moment. As we saw in Corollary 5.2.8, 
then its expected value p := EX exists. Furthermore, letting b = —y, by Corollary 5.2.9, 
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we also have E|X — u|* < oo. After this preparation we can introduce the variance of a 
random variable. 


Definition 5.2.13. Let X be a random variable possessing a finite second moment. 
If uw := EX, then its variance is defined as 


VX := E[X —- pl? = E|X - EX/?. 


Interpretation: The expected value y of a random variable is its main characteristic. 
It tells us around which value the observations of X have to be expected. But it does 
not tell us how far away from p these observations will be on average. Are they con- 
centrated around p or are they widely dispersed? This behavior is described by the 
variance. It is defined as the average quadratic distance of X to its mean value. If VX is 
small, then we will observe realizations of X quite near to its mean. Otherwise, if VX 
is large, then it is very likely to observe values of X far away from its expected value. 

How do we evaluate the variance in concrete cases? We answer this question for 
discrete and continuous random variables separately. 


Proposition 5.2.14. Let X be arandom variable with finite second moment and let u « R 
be its expected value. Then it follows that 


VX = Oyj -)?-pj and VX = / (x — pw)? p(x) dx (5.30) 
jal ca 


in the discrete and continuous case, respectively. Hereby, x, X2, ... are the possible val- 
ues of X and p; = P{X = x;} in the discrete case, while p denotes the density of X in the 
continuous case. 


Proof: The assertion follows directly by an application of properties (4) and (5) of 
Proposition 5.1.36 to f(x) = (x - yw). a 


Before we present concrete examples, let us state and prove certain properties of the 
variance, which will simplify the calculations later on. 


Proposition 5.2.15. Assume X and Y are random variables with finite second moment. 
Then the following are valid. 


(i) We have 


VX = EX? - (EX). (5.31) 
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(ii) If P{X = c}=1forsomec € R, then’ VX = 0. 


(iii) Fora, b « R follows 


that 


V(aX +b) =a° VX. 


(iv) In the case of independent X and Y one has 


Proof: Let us begin with the proof of (i). With p = 


V(X+Y)=VX+VY. 


VX = E(X - p)? = E[X? - 2uX + y 


This proves (i). 


= EX? — 2y? +p? = EX?-,. 


2] = 


1X we obtain 


TX? — 


2uEX + yw? 


To verify (ii) we use property (2) in Proposition 5.1.36. Then wp = EX = c, hence 
P{X — p = 0} = 1. Another application of property (2) leads to 


as asserted. 


VX = E(X - yp)? 


Next we prove (iii). If a,b « R, then E(ax + 
expected value. Consequently, 


V(aX + b) = 


Thus (iii) is valid. 


To prove (iv) observe that, if w := EX and v := 


V(X + Y) 


a[(X-p)+(¥-v]’ 


= VX +2E[(X -p)(Y - 


=0 


b) = aEX + b by the linearity of the 


EX)* = a*VX. 


1[aX +b -(a 1X + b)|° =@7E(X- 


ZY, then 


i(X+Y) =p+v, and hence 


v)]| + VY. 


= E(X - py) + 2E[(X - w)(¥ - v)] + E(Y - v)? 


(5.32) 


By Proposition 4.1.9 the independence of X and Y implies that of X — wand Y - v. 


Therefore, from property (6) in Proposition 5.1.36 we derive 


[(X - wY - v)] 


a(X - ) -E(Y - v) = ( 


EX — y) - ( 


tY-v)=0-0=0. 


Plugging this into eq. (5.32) completes the proof of (iv). 


5 The converse implication is also true. If VX = 0, then X is constant with probability 1. 
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5.2.3 Variance of Certain Random Variables 


Our first objective is to describe the variance of a random variable uniformly distrib- 
uted on a finite set. 


Proposition 5.2.16. If X is uniformly distributed on {x, ... , Xn}, then 
LX 
Uae. See. 
Fl 
where py is given by u = Hj Xi- 


Proof: Because of p; = Ye 1 <j <N, this is a direct consequence of eq. (5.30). Recall 
that w was computed in eq. (5.6). | 


Example 5.2.17. Suppose X is uniformly distributed on {1, ... , 6}. Then EX = 7/2, and 
we get 


6 — 
Thus, when rolling a die once, the variance is given by 2. 
Now assume that we roll the die n times. Let X;, ... , X, be the results of the single 


rolls. The X;s are independent, hence, if S, = X; + - - - + X, denotes the sum of the n 
trials, then by (iv) in Proposition 5.2.15 it follows that 


35n 
VSn = V(X + + + + Xn) = VX, +--+ -VX_ = oo 
The next proposition examines the variance of binomial distributed random variables. 


Proposition 5.2.18. If X is By, )-distributed, then 


VX =np(1-p). (5.33) 


Proof: Let X be By,p-distributed. In Example 5.1.44 we found EX? = n?p? + np(1- p). 
Moreover, EX = np by Proposition 5.1.13. Thus, from formula (5.31) we derive 


VX = EX? - (EX)? = n’p? + np(1- p) - (np)? = np(1 - p) 


as asserted. Oo 
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Corollary 5.2.19. Binomial distributed random variables have maximal variance (with 


n fixed) if p = 1/2. 


Proof: The function p + np(1 — p) becomes maximal for p = 5. In the extreme cases 
p = Oand p = 1 the variance is zero. a 


Next we determine the variance of Poisson distributed random variables. 
Proposition 5.2.20. Let X be Pois,-distributed for some A > 0. Then 


VX =A. 


Proof: In Example 5.1.42 we computed EX? = A*+A. Furthermore, by Proposition 5.1.16 
we know that EX = A. Thus, by eq. (5.31) we obtain, as asserted, 


VX = EX’ - (EX) =17+A- =A. a 


Next, we compute the variance of a geometric distributed random variable. 


Proposition 5.2.21. Let X be G,-distributed for some 0 < p < 1. Thenits variance equals 


1, a 

VX = a . 
Proof: In Example 5.1.44 we found EX? = >, and by eq. (5.14) we have EX = ff 
Consequently, formula (5.31) implies 

I=p- fT Asp 
Vx 2 me me 

Dp Dp Dp 

as asserted. o 


Corollary 5.2.22. IfX is B,,,-distributed, then 


T= 
Vx =n? 


Proof: Let X, ... , X, be independent G,-distributed random variables. By Corollary 
4.6.9 their sum X := X;+ - -- +Xy is B,,-distributed, hence property (iv) in Proposition 
5.2.15 lets us conclude that 
1-p 
TENGE SS ES GN EE = 
Interpretation: The smaller p becomes the bigger is the variance of a geometrically or 
negative binomially distributed random variable (for n fixed). This is not surprising, 
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because the smaller p is, the larger is the expected value, and the values of X may be 
very far from 1/p (success is very unlikely). 

We consider now variances of continuous random variables. Let us begin with 
uniformly distributed ones. 


Proposition 5.2.23. Let X be uniformly distributed on an interval [a, B]. Then it fol- 
lows that 


Proof: We know by Proposition 5.1.25 that EX = (a + f)/2. In order to apply formula 
(5.31), we still have to compute the second moment EX?. Here we obtain 


2X? 


B 
ae [tant Foe eee 
Bp-a 3 B-a 3 


a 


Consequently, formula (5.31) lets us conclude that 


VX = EX? (ex - B+ B+ at ea) 


3 2 
BP +aB+a? a? +2aB + B? 
3 4 
a? —2aB+B?  (B-a)? 
12 12 
This completes the proof. a 


In the case of gamma distributed random variables, the following is valid. 
Proposition 5.2.24. If X is T'g,-distributed, then 


VX = a’B. 


Proof: Recall that EX = af by Proposition 5.1.26. Furthermore, in Example 5.2.10 we 
evaluated EX” for a gamma distributed X. Taking n = 2 implies 


uX? = a7 (B+1)B, 


and, hence, by eq. (5.31), 


VX = EX? — (EX) = a? (8 + 1) B - (a) = a8 


as asserted. a 
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Corollary 5.2.25. If X is E,-distributed, then 


1 
VX = a . 
Proof: Because of E, = I: ,, this directly follows from Proposition 5.2.24. a 
a 


Corollary 5.2.26. For ay2-distributed X holds 
VX = 2n. 


Proof: Let us give two alternative proofs of the assertion. The first one uses Proposi- 
tion 5.2.24 and yr, = T,2. 

The second proof is longer, but maybe more interesting. Let X;,...,Xn be 
independent /V(0, 1)-distributed random variables. Proposition 4.6.17 implies that 
Xj+ - +» +X? isy?-distributed, thus property (iv) of Proposition 5.2.15 applies and leads 
to 


VX = VX24+-.--+VX2 =nVX?. 


In Example 5.2.12 we evaluated the moments of an /V(0, 1)-distributed random vari- 
able. In particular, EX? = 1 and, E(X?)* = EX? = 3!! = 3, hence 


VX = nVX? = n(EX} — (EX?)*) = (3-1)n=2n 


as claimed. o 
Finally we determine the variance of a normal random variable. 
Proposition 5.2.27. If X is N(u, o*)-distributed, then it follows that 

VX =o. 


Proof: Of course, this could be proven by computing the integral 
VX = / (x = w)Pyo (wax. 


We prefer a different approach that avoids the calculation of integrals. Because of Pro- 
position 4.2.3, the random variable X may be represented as X = 0X9 +p for a standard 
normal Xo. Applying (iii) in Proposition 5.2.15 gives 


VX = 0° VX. (5.34) 
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But EXo = 0, and by Example 5.2.12 we have EX¢ = 1, thus 


VX9 =1-0=1. 
Plugging this into eq. (5.34) proves VX = 07. | 


Remark 5.2.28. The previous result explains why the parameter o? of an N(, 0*)- 
distribution is called “variance.” 


5.3 Covariance and Correlation 
5.3.1 Covariance 


Suppose we know or we conjecture that two given random variables X and Y are 
dependent. The aim of this section is to introduce a quantity that measures their de- 
gree of dependence. Such a quantity should tell us whether the random variables are 
strongly or only weakly dependent. Furthermore, we want to know what kind of de- 
pendence we observe. Do larger values of X trigger larger values of Y or is it the other 
way round? To illustrate these questions let us come back to the experiment presented 
in Example 2.2.5. 


Example 5.3.1. In an urn are n balls labeled with “O” and another n balls labeled with 
“1,” Choose two balls out of the urn without replacement Let X be the number appear- 
ing on the first ball and Y that on the second. Then X and Y are dependent (check this), 
but it is intuitively clear that if nm becomes larger, then their dependence diminishes. 
We ask for a quantity that tells us their degree of dependence. This measure should 
decrease as n increases and it should tend to zero as n > oo. 

Moreover, if X = 1 occurred, then there remained in the urn more balls with “0” 
than with “1,” and the probability of the event {Y = 0} increases. Thus, larger values 
of X make smaller values of Y more likely. 


Before we are able to introduce such a “measure of dependence,” we need some 
preparation. 


Proposition 5.3.2. If two random variables X and Y possess a finite second moment, 
then the expected value of their product X Y exists. 


Proof: We use the elementary estimate |ab| < valid for a,b ¢« R. Thus, ifw « 
Q, then 


X(w)? -Y(w)? 


|X(w)Y(w)| < es 


’ 
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that is, we have 
xX Y 
IXY|< —+—. (5.35) 
2 2 


By assumption 


2 2 
[5+ >]-j [ers BY? | < 00:3 


consequently, because of estimate (5.35), property (7) in Proposition 5.1.36 applies and 
tells us that E|XY| < oo. Thus, E[XY] exists as asserted. o 


How do we compute E[XY] for given X and Y ? In Section 4.5 we observed that the 
distribution of X + Y does not only depend on the distributions of X and Y. We have 
to know their joint distribution, that is, the distribution of the vector (X, Y). And the 
same is true for products and the expected value of the product. 


Example 5.3.3. Let us again investigate the random variables X, Y, X’, and Y’ intro- 
duced in Example 3.5.8. Recall that they satisfied 


1 1 

PIX =0,¥=O}=—, PIX =0,¥=1}= 3, 

1 1 

P{x=1,Y=O}=-, P{X=1,Y=1}=-, 

3 6 

1 1 
P{xX’=0,Y’=O}=—-, P{X’=0,Y’=]}=-, 

4 4 

1 1 
P{xX’=1,Y’=O}=—, P{X’=1,Y’=1}=- 

i } 7 { } i 

Then Py = Py as well as Py = Pyr, but 
1 1 1 1 1 
E[XY] = ~-(0-0)+=(1-0)+=(0-1)+=(-N=—+ and 

[XY] g ) 5 | ) a ) ral ) 7 
1 1 1 al 1 
ey | = 0-0) +00) O14 N= —. 
[X’Y"] res eee a pe ) Z 


This example tells us that we have to know the joint distribution in order to compute 
[XY]. The knowledge of the marginal distributions does not suffice. 


To evaluate E[XY] we need the following two-dimensional generalization of formulas 
(5.21) and (5.22). 


Proposition 5.3.4. Let X and Y be two random variables and let f : R? > R be some 
function. 
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1. Suppose X and Y are discrete with values in {x1, x2, ...}andin {y1, yo, ...}. Set py = 
P{X = Xi, Y= yj}. If 


If(%, Yl = D> 1fGa. yp) Dy < 09, (5.36) 


ij=l 


then Ef (X, Y) exists and can be computed by 


if (X, Y) = SFO Y) Pi 


ij=1 


2. Let f : R* + R be continuous®. If p : R? > R is the joint density of (X, Y) (recall 
Definition 3.5.15), then 


a f(X, Y)| = ‘ / FO, y)I p(x y) dxdy < 00 (5.37) 


—00 —0O 


implies the existence of Ef(X, Y), which can be evaluated by 


af(X, Y) = / / F(x y) p(x, y) dxdy. (5.38) 


—0o —0O 


Remark 5.3.5. The previous formulas extend easily to higher dimensions. That is, if 
X = (X, ... ,X,) is an n-dimensional random vector with (joint) distribution density 
p: R" = R, then for continuous’ f : R” ~ R one has 


3f(X) = Ef(X, ... .Xn) = | fea. eee ee eee 


provided the integral exists. The case of discrete X;, ... , X, is treated in a similar way. 
If X maps into the finite or uncountably infinite set D c R", then 


af(X) = Ef(X ... Xn) = D> fOOPX = x}. 


xeD 


6 In fact we need only a measurability in the sense of 4.1.1, but this time for functions f from R? to R. 
For our purposes “continuity” of f suffices. 
7 cf. the remark for n = 2. 
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If we apply Proposition 5.3.4 with f : (x, y) + x-y, then we obtain the following formu- 
las for the evaluation of E[XY]. Hereby, we assume that conditions (5.36) or (5.37) are 
satisfied. 


Corollary 5.3.6. In the notation of Proposition 5.3.4 the following are valid: 


XY] = Dey and BLXYI= ff (-y)plxyaxay 


ij=l —00 —0o 
in the discrete and in the continuous case, respectively. 


After all these preparations we are now in position to introduce the covariance of two 
random variables. 


Definition 5.3.7. Let X and Y be two random variables with finite second mo- 
ments. Setting wp = EX and v = EY, the covariance of X and Y is defined as 


Cov(X, Y) = E[(X - p)(Y - v)]. 


Remark 5.3.8. Apply Corollary 5.2.9 and Proposition 5.3.2 to see that the covariance is 
well-defined for random variables with finite second moment. Furthermore, in view of 
Proposition 5.3.4, the covariance may be computed as 


Cov(X, Y) = Yoo — Wy; - v) pij 
ijl 


in the discrete case (recall that pj = P{X = x;, Y = y;}), and as 


Cov(X, Y) = - / (x — w) (y — v) p(x, y) dxdy 


—co —co 
in the continuous one. 


Example 5.3.9. Let us once more consider the random variables X, Y, X’, and Y’ in 
Example 3.5.8 or Example 5.3.3, respectively. Each of the four random variables has 
the expected value 1/2. Therefore, we obtain 


onan (0-4)-(0-8) 4 (-8}0-8) 
36-2. (-Det OD. (Dek 
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while 


cmsr)= 3 (0-2)-(0-8) +4 (1-4) (0-2 
“E-f):(ef)eEG-4)C-H)-°. 


The following proposition summarizes the main properties of the covariance. 


Proposition 5.3.10. Let X and Y be random variables with finite second moments. Then 
the following are valid. 

(1) Cov(X, Y) = Cov(Y, X). 

(2) Cov(X, X) = VX. 

(3) The covariance is bilinear, that is, for X,, X2 and real numbers a, and ap 


Cov(a,X1 + a2X2, Y) = ayCov(Xj, Y) + anCov(X, Y) 
and, analogously, 
Cov(X, by; 4 + b2Y>) = b,Cov(X, Y;) + b2Cov(X, Y>) 


for random variables Y,, Y> and real numbers by, bo. 
(4) The covariance may also evaluated by 


Cov(X, Y) = E[XY] - (EX)(EY). (5.39) 
(5) Cov(X, Y) = 0 for independent X and Y. 
Proof: Properties (1) and (2) follow directly from the definition of the covariance. 


Let us verify (3). Setting w, = EX; and py = EX, the linearity of the expected value 
implies 


2(a,X + aX) = qi + a2p2 ‘ 


Hence, if v = EY, then 


Cov(a,X; + a2X2, Y) = E[(ay(X1 — pi) + a2(X2 - p2))(¥ - v)] 
= ME [(X% — )(Y — v)] + aE [(X2 — pa)(¥ - v)] 
= a,Cov(Xj, Y) + axCov(X, Y). 


This proves the first part of (3). The second part can be proven in the same way or one 
uses Cov(X, Y) = Cov(Y, X) and the first part of (3). 
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Next we prove eq. (5.39). With p = EX and v = EY by 


(X -p)(Y -v) = XY-ywY -vX +p, 


we get that 


Cov(X, Y) = E[XY - pY -— vX + py = E[XY] - wEY - vEX + pv 


= E[XY] - pv. 


This proves (4) by the definition of y and v. 

Finally, we verify (5). If X and Y are independent, then by Proposition 4.1.9 this 
is also true for X - p and Y - v. Thus, property (6) of Proposition 5.1.36 applies and 
leads to 


Cov(X, Y) = E[(X - y)(Y - v)] = E(X - pw) E(Y - v) = [EX - yp] [EY -v] = 0. 


Therefore, the proof is completed. ia 


Remark 5.3.11. Quite often the computation of Cov(X, Y) can be simplified by the use 
of eq. (5.39). For example, consider X and Y in Example 3.5.8. In Example 5.3.3 we 
found E[XY] = 1/6. Since EX = EY = 1/2, by eq. (5.39) we immediately get 


We obtained the same result in Example 5.3.9 with slightly more efforts. 


Property (5) in Proposition 5.3.10 is of special interest. It asserts Cov(X, Y) = 0 for inde- 
pendent X and Y. One may ask now whether this characterizes independent random 
variables. More precisely, are the random variables X and Y independent if and only if 
Cov(X, Y) =0? 

The answer is negative as the next example shows. 


Example 5.3.12. The joint distribution of X and Y is given by the following table: 


Y\X |-1 0 1 
1 1 1 1 3 
10 10 10 |10 
1 2 1 2 
0 10 10 10 5 
Il ft |S. 
10 10 10 |10 

3 2 3 

10 5 10 


Of course, EX = EY = 0 and, moreover, 


[XY] a ((-1)(-1) + DCH) + (1)(-1) + (+4)() = 0, 
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which by eq. (5.39) implies Cov(X,Y) = 0. On the other hand, Proposition 3.6.9 
tells us that X and Y are not independent. For example, P{X = 0, Y = 0} = : while 
P{X = O}P{Y = 0} = $. 


Example 5.3.12 shows that Cov(X, Y) = 0 is in general weaker than the independence 
of X and Y. Therefore, the following definition makes sense. 


Definition 5.3.13. Two random variables X and Y satisfying Cov(X, Y) = O are said 
to be uncorrelated. Otherwise, if Cov(X, Y) # 0, then X and Y are correlated. 

More generally, a sequence X;, ... ,X, of random variables is called (pair- 
wise) uncorrelated, if Cov(X;, X;) = 0 whenever i ¢ j. 


Using this notation, property (5) in Proposition 5.3.10 may now be formulated in the 
following way: 


X and Y independent = X and Y uncorrelated 


Example 5.3.14. Let A,B « A be two events in a probability space (QO, A, P) and let 
1, and 1, be their indicator functions as introduced in Definition 3.6.14. How can we 
compute Cov(1,, 1g) ? 

Answer: Since El, = P(A), we get 


Cov(1,4, 1g) = E[1,4 1g] - (E1,4)(E1g) = Elyap - P(A) PB) 
= P(AnB) - P(A) P(B). 


This tells us that 1,4 and 1g are uncorrelated if and only if the events A and B are 
independent. But as we saw in Proposition 3.6.15, this happens if and only if the ran- 
dom variables 1, and 1g are independent. In other words, two indicator functions are 
independent if and only if they are uncorrelated. 


Finally we consider the covariance of two continuous random variables. 


Example 5.3.15. Suppose a random vector (X, Y) is uniformly distributed on the unit 
ball of R?. Then the joint density of (X, Y) is given by 
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We proved in Example 3.5.19 that X and Y possess the distribution densities 


2 _y2- 2 aylus 
ae =|) ile 4 w= [FY y:lyl<1 
0) : |x| >1 0 :ly|>1 


The function y + y(1 — y?)!? is odd. Consequently, because we integrate over an 
interval symmetric around the origin, 


1 
2 
iX = EY = = [ ya-yyPay=0. 
-1 


By the same argument we obtain 


— est 
sxvl= [| (-pe.yaxay== fy / xdx | dy =0, 
res a ee 


and these two assertions imply Cov(X, Y) = 0. Hence, X and Y are uncorrelated, but as 
we already observed in Example 3.6.19, they are not independent. 


5.3.2 Correlation Coefficient 


The question arises whether or not the covariance is the quantity that we are looking 
for, that is, which measures the degree of dependence. The answer is only partially 
affirmative. Why? Suppose X and Y are dependent. If ais a nonzero real number, then 
a natural demand is that the degree of dependence between X and Y should be the 
same as that between aX and Y. But 


Cov(ax, Y) = aCov(X, Y), 


thus, if a # 1, then the measure of dependence would increase or decrease. To 
overcome this drawback, we normalize the covariance in the following way. 


Definition 5.3.16. Let X and Y be random variables with finite second moments. 
Furthermore, we assume that neither X nor Y are constant with probability 1, that 
is, we have VX > O and VY > O. Then the quotient 


Cov(X, Y) 


p(X, Y) := (vx)IR(vy)I2 


(5.40) 


is called correlation coefficient of X and Y. 
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To verify a crucial property of the correlation coefficient we need the following version 
of the Cauchy—Schwarz inequality. 


Proposition 5.3.17 (Cauchy—Schwarz inequality). For any two random variables X and 
Y with finite second moments it follows that 


i(XY)| < (EX?)"” (Ey?)'? . (5.41) 


Proof: By property (8) of Proposition 5.1.36 we have 


O < E(\X| —A|Y|)* = EX? - 20E|XY| + A? EY? (5.42) 


for any A < R. To proceed further, we have to assume® EX? > 0 and EY? > 0. The latter 
assumption allows us to choose A as 


1X2 1/2 
A: oa ) : 
( qY2)12 
If we apply inequality (5.42) with this A, then we obtain 
25 (EX a ext e apy? 9 EXD! 5 
0) < EX -2 (EY2)12 t|XY | + wX 2 EX 2 (EY) E|XY| , 


which easily implies (recall that we assumed EX? > 0) 


aIXY| < (EX?)'? (gy?)'? , 


To complete the proof, we use Corollary 5.1.39 and get 


= 


(XY)| < E|XY| < (EX?)"” (Ey?)’? 


as asserted. Ps 
Corollary 5.3.18. The correlation coefficient satisfies 


-1<p(X,Y) <1. 


Proof: Let as before y = EX and v = EY. Applying inequality (5.41) to X - wand Y-v 
leads to 


\Cov(X, Y)| = |E(X - w)(¥ -v)| < (B= )?)"” (EY = v))"” 
= (vx)? (vy)"?, 


8 The Cauchy-Schwarz inequality remains valid for EX? = 0 or EY? = 0. In this case follows P{X = 
0} = 1or P{Y = 0} = 1, hence P{XY = 0} = land E[XY] = 0. 
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or, equivalently, 
~(Vx)"? (vY)!? < Cov(X, Y) < (Vx)? (vy)!? . 


By the definition of p(X, Y) given in eq. (5.40), this implies -1 < p(X, Y) < 1 as asserted. 
a 


Interpretation: For uncorrelated X and Y we have p(X, Y) = 0. In particular, this is valid 
if X and Y are independent. On the contrary, p(X, Y) # 0 tells us that X and Y are de- 
pendent. Thereby, values near to zero correspond to weak dependence, while p(X, Y) 
near to 1 or —1 indicate a strong dependence. The strongest possible dependence is 
when Y = aX for some a # O. Then p(X, Y) = 1ifa > 0 while p(X, Y) = -1fora< 0. 


Definition 5.3.19. Two random variables X and Y are said to be positively cor- 
related if p(X, Y) > 0. In the case that p(X, Y) < 0, they are said to be negatively 
correlated. 


Interpretation: X and Y are positively correlated, provided that larger (or smaller) val- 
ues of X make larger (or smaller) values of Y more likely. This does not mean that 
a larger X-value always implies a larger Y-value. Only that the probability for those 
larger values increases. And in the same way, if X and Y are negatively correlated, 
then larger values of X make smaller Y-values more likely. Let us explain this with 
two typical examples. Choose by random a person w in the audience. Let X(w) be his 
height and Y(w) his weight. Then X and Y will surely be positively correlated. But 
this does not necessarily mean that each taller person has a bigger weight. Another 
example of negatively correlated random variables could be as follows: X is the aver- 
age number of cigarettes that a randomly chosen person smokes per day and Y is his 
lifetime. 


Example 5.3.20. Let us come back to Example 5.3.1: in an urn are n balls labeled with 
“O”and n labeled with “1.” One chooses two balls without replacement. Then X is the 
value of the first ball, Y that of the second. How does the correlation coefficient of X 
and Y depend on n? 

Answer: The joint distribution of X and Y is given by the following table: 


Y\xX| 0 1 
(0) a 


INI Note 


1 n n-1 


5.4 Problems ——= 243 


Direct computations show EX = EY = 1/2 and VX = VY = 1/4. Moreover, it easily 


follows E[XY] = +4, hence 


n-1 1 -1 
Cov(X, Y) rn = ’ 


and the correlation coefficient equals 


Te 1 
p(X, Y) = = 


1/1 2n-1° 
ay a 


If n > ov, then p(X, Y) is of order =" Hence, if n is large, then the random variables X 
and Y are “almost” uncorrelated. 

Since p(X, Y) < 0, the two random variables are negatively correlated. Why? This 
was already explained in Example 5.3.1: an occurrence of X = 1 makes Y = O more 
likely, while the occurrence of X = O increases the likelihood of Y = 1. Some word 
about the case n = 1. Here Y is completely determined by the value of X, expressed by 
p(X, Y) =-1. 


5.4 Problems 


Problem 5.1. 

1. Put successively and independently of each other n particles into N boxes. 
Thereby, each box is equally likely. How many boxes remain empty on average? 

Hint: Define random variables X;, ... , Xy as follows: set X; = 1ifboxiremains 

empty and X; = 0, otherwise. 

2. Fifty persons write randomly (according to the uniform distribution), and inde- 
pendently of each other, one of the 26 letters in the alphabet on a sheet of paper. 
On average, how many different letters appear? 


Problem 5.2. Let (Q,.A,P) be a probability space. Given (not necessarily disjoint) 
events Ay, ... ,An in A and real numbers aj, ... , Qn, define X : Q > Rby?. 


n 
X= Do ajla,. 
jel 


1. Why is X a random variable? 
2. Prove 


n n 

IX =) ajP(4j) and VX =) aia; [P(Ain Aj) - P(A)P(A))] . 
jel ij=l 

How does VX simplify for independent events Aj, ... , An? 


9 For the definition indicator functions 14, see eq. (3.20). 
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Problem 5.3. Suppose a fair “die” has k faces labeled by the numbers from 1 to k. 

1. How often one has to roll the die on the average before the first “1” shows up? 

2. Suppose one rolls the die exactly k times. Let p, be the probability that “1” appears 
exactly once and q, is the probability that “1” shows up at least once. Compute 
Dx and q, and determine their behavior as k > ov, that is, find lim, ... px and 
limyco Yk: 


Problem 5.4. 
1. Let X bea random variable with values in No = {0, 1, 2, ... }. Prove that 


iX = PUK > Ky. 


k=1 


2. Suppose now that X is continuous with P{X > 0} = 1. Verify 


So P{X 2k} < EX<1+) P{X> kj. 
k=1 k=1 


Problem 5.5. Let _X be an No-valued random variable with 
P{x=k}=q", k=1,2,... 


for some q > 2. 

(a) Why we have to suppose q > 2, although )°2, q* < « forg>1? 
(b) Determine P{X = 0}? 

(c) Compute EX by the formula in Problem 5.4. 

(d) Compute EX directly by EX = }°7°, kP{X = k}. 


Problem 5.6. Two independent random variables X and Y with third moment satisfy 
iX = EY = 0. Prove that then 


i(X + Y)? =EX?+EY?. 


Problem 5.7. A random variable X is Pois,-distributed for some A > 0. Evaluate 


fa (Xx 
(=) and (=). 


Problem 5.8. In a lottery are randomly chosen 6 of 49 numbers. Let X be the largest 
number of the 6 ones. Show that 


2 6a = 
A= 6 Ze I)(k = 2)(k - 3)(k - 4)(k - 5) = 42.8571. 


Evaluate EX if X is the smallest number of the 6 chosen. 
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Hint: Either one modifies the calculations for the maximal value suitably or one 
reduces the second problem to the first one by an easy algebraic operation. 


Problem 5.9. A fair coin is labeled by “O” on one side and with “1” on the other one. 
Toss it four times. Let X be the sum of the two first tosses and Y be the sum of all 
four ones. Determine the joint distribution of X and Y. Evaluate Cov(X, Y) as well 
as p(X, Y). 


Problem 5.10. In an urn are five balls, two labeled by “O” and three by “1.” Choose 

two balls without replacement. Let X be the number on the first ball and Y that on the 

second. 

1. Determine the distribution of the random vector (X, Y) as well as its marginal 
distributions. 

2. Compute p(X, Y). 

3. Which distribution does X + Y possess? 


Problem 5.11. Among 40 students are 30 men and 10 women. Also, 25 of the 30 men 

and 8 of the 10 women passed an exam successfully. Choose randomly, according to 

the uniform distribution, one of the 40 students. Let X = 0 if the chosen person is a 

man, and X = 1if it is a woman. Furthermore, set Y = 0 if the person failed the exam, 

and Y = 1if she or he passed. 

1. Find the joint distribution of X and Y. 

2. Are X and Y independent? If not, evaluate Cov(X, Y). 

3. Are X and Y negatively or positively correlated? What does it express, when X and 
Y are positively or negatively correlated? 


Problem 5.12. Let (O,.A, P) be a probability space. Prove for any two events A and B 
in A the estimate 


|P(A n B) — P(A) P(B)| < 


Bie 


Is it possible to improve the upper bound i ? 


Problem 5.13. (Problem of Luca Pacioli in 1494; the first correct solution was found 
by Blaise Pascal in 1654) Two players, say A and B, are playing a fair game consisting 
of several rounds. The first player who wins six rounds wins the game and the stakes 
of 20 Taler that have been bet throughout the game. However, one day the game is 
interrupted and must be stopped. If player A has won five rounds and player B has 
won three rounds, how should the stakes be divided fairly among the players? 
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Problem 5.14. In Example 5.1.46 we computed the average number of necessary pur- 
chases to get all n pictures. Let m be an integer with 1 < m < n. How many purchases 
are necessary on average to possess m of the n pictures? 
For n even choose m = n/2 and for n odd take m = (n — 1)/2. Let M,, be the average 
number of purchases to get m pictures, that is, to get half of the pictures. Determine 
Mn 


lim —. 
n>oco n 


Hint: Use eq. (5.26). 


Problem 5.15. Compute E|X|?"*! for a standard normal distributed X andn = 0,1, .... 


Problem 5.16. Suppose X has the density 


Ge 0) 2: x<il 
P Cy ES 
for some a < -1. 

1. Determine cg such that p is a density. 
2. For which n > 1 does X possess an nth moment? 


Problem 5.17. Let U be uniform distributed on an interval [a, 8]. Show that for n > 1 


7 Br + aprt+ co a 1B + Br 


n+1 


SU" 


Problem 5.18. Let X;, ... ,X, be random variables with finite second moment and 
with EX; = 0. Show that 


n n 
[Xi +--+ -+Xn]? =} Cov(Xi, Xj) = )>VXj+2 D> Cov(Xi, Xj). 


ij=1 j=l 1s<i<j<n 


Problem 5.19. Show 


7X = —_ 
N 


for a hypergeometric distributed random variable X with 


Ga Gen) 
ie) 


Problem 5.20. Let X be \(0, 1)-distributed. Determine VX? and VX". 


P{X = m} = 
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Problem 5.21. Given a non-negative random variable X, define gx from [0, oo) to 

[0, co] by y(t) = Et*. Then gy is called generating function of X (see [GSO1b], 

Section 5.1). 

(1) Suppose X has values in No. Show that, if t > 0, then this “new” definition of the 
generating function coincides with the one given in Problem 4.2. 

(2) Let X,, ... , Xn be independent and non-negative. For a; > 0,1<j <n, let 


X=Q,X,+ +--+ +AnXn. 
Prove 
px(t) = px,(t™) - - - px, (t"). 


(3) Find @x for an exponentially distributed X. 


6 Normally Distributed Random Vectors 


6.1 Representation and Density 


In Example 3.4.3 we considered a two-dimensional random vector (X;, X2), where X, 
was the height of a randomly chosen person and X was his weight. From experience 
and in view of the central limit theorem (cf. Section 7), it is quite reasonable to as- 
sume that X; and X2 are normally distributed. Suppose we are able to determine their 
expected values and their variances. However, this is not sufficient to describe the ex- 
periment. Why? The random variables X; and X2 are surely dependent, and the most 
interesting problem is to describe their degree of dependence. This cannot be done 
based only on the knowledge of their distributions. What we really need to know is 
their joint distribution. Therefore, we not only have to suppose X; and X> to be normal, 
but the generated vector (X;, X2) has to be as well. 

But what does it mean that a random vector is normally distributed? This section 
is devoted to answer this and related questions. 

Let us first recall the univariate case, investigated in Example 4.2.2 and in the 
subsequent Proposition 4.2.3. The main observation was that a random variable Y is 
normally distributed if and only if it may be written as 


Y=aX+u (6.1) 


for some a # 0, w € R, anda standard normal random variable X. 

Let now Y = (Y1,..., Y,) bean n-dimensional random vector. We want to represent 
it in the same way as Y in eq. (6.1). Consequently, we have to replace X by a multivari- 
ate standard normal vector and the function x + ax + yu by a suitable mapping from 
R” to R". But which kind of mapping this should be and what is an n-dimensional 
standard normal vector? 

Let us begin by answering the second question. Therefore, recall the definition 
of the multivariate standard normal distribution V(0, 1)®”" introduced in Definition 
1.9.16. This probability measure on (R", B(R")) was given by 


_ 1 _}yi2 
NO,0® B)= aoe fev eax 


B 
1 
= Sait [. ; f et Magy... diy 
—— ee 
B 


with B ¢ B(R"). Thus, a random vector X should be standard normal distrib- 
uted whenever its probability distribution is M(0, 1)®". Let us formulate this as a 
definition. 
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Definition 6.1.1. A random vector X = (X1,..., Xn) is standard normally (distrib- 
uted) if its probability distribution satisfies Pz = \V(0, 1)®". 


To make this definition more descriptive, let us state some equivalent properties. 


Proposition 6.1.2. Fora random vector X = (Xi, ..., Xn) the following are equivalent: 
1. X is standard normal. 
2. IfB « BCR"), then 


P{X ¢ B} = 


sah fevPrar. 


B 


3. The coordinate mappings X,,..., Xn are (univariate) standard normal distributed 
and independent. That is, for allt; « R,1<j<n, 


P{X; <t,...,Xn< tn} = P{X; = ti} eo P{Xy, < tn} 
i 
1 2 
= _ eddy peosts. Mh es, 


Proof: Taking into account the definition of \/(0,1)®", this is an immediate con- 
sequence of Propositions 3.6.5 and 3.6.18. Compare also the considerations in Ex- 
ample 3.6.22. o 


An adequate substitute for x + ax + win representation (6.1) is still undetermined. 
Which mappings in IR” should be considered? 

Observe that x + ax+wis affine linear from R to R. The counterpart in R” is of the 
form x +> Ax+uyp, where A is a linear mapping in R” and p € R". Linear mappings in R” 


are described by n x n matrices A = (3) 4 and act as follows: 


n 
AX = ae sens Y Bapy » X=(%4,...,Xn) € R". 


Consequently, the suitable generalization of x + ax + wis the mapping x + Ax+yp 
with an n x n matrix A and p € R". The condition a # 0 transfers to det(A) # 0 or, 
equivalently, A has to be regular, that is, the generated mapping is one-to-one from 
IR" onto R"”. Here and in the sequel we will use results and notations as presented in 
Section A.4. 

Now we are in position to define normally (distributed) random vectors. 
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Definition 6.1.3. Arandom vector Y is said to be normally distributed (or simply, 
normal) provided there exists a regular nxn matrix A and a vector p € IR" such that 


¥ =AX + (6.2) 


for some standard normal X. 


Remark 6.1.4. Let us reformulate Definition 6.1.3 due to its importance. A random vec- 
tor Y = (Y;,..., Yn) is normal if and only if there exists a regular matrix A = (ai) oe 
and a vector p = (1, ..., Hn) such that 


n 
Yi = > 0 aX; + yi, 1l<i<n, 
jel 


with Xj, ...,Xn independent \V(0, 1)-distributed. 


Example 6.1.5. Suppose the three-dimensional random vector a (Y1, Yo, Y3) is 
defined by 


Y, = 2X, + Xp -— X3 +4, Yo = X, — 2X2 + X3-2 and 
Y3 = X,- 2X3+5 


with (0, 1)-distributed independent X;, X2, X3. Then Y is normally distributed. Ob- 
serve that it may be represented in the form of eq. (6.2) with A given by 


2 1 -1 
A={!1 -2 1 
T: sO 2 


and with p = (4, -2, 5). Moreover, we have det(A) = 9, hence A is regular. 


Given a normal vector Y, how do we get the standard normal Xin representation (6.2)? 
The next proposition answers this question. 


Proposition 6.1.6. A random vector Y = (Y1,..., Yn) is normal if and only if there exists 
aregular nxn matrix B = Bidijen and a vector v = (V},..., Vn) € R" such that the random 
variables X;, defined by 


n 
Xj:= S"pyYj+vi, 1sisn, 
jel 


are independent standard normal. 
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Proof: This is a direct consequence of the following observation. One has Y =AX+ 7 
if and only if X may be represented as X = A"'Y—A71y. Therefore, the assertion follows 
by choosing B and v such that B = A“! and v = —A"ly. rT] 


Example 6.1.7. For the random vector Y investigated in Example 6.1.5, the generated 
independent standard normal random variables X;, X2, and X3 may be represented as 
follows: 


1 1 
Bag reas?) Aegis tds} t) 


1 
X3= pt eae) . 


Suppose Y =AX+ pis anormal vector. How can we evaluate its distribution density? 
To answer this question, we introduce the following function. Let R > 0 be ann x n- 
matrix and ps ¢ R". The inverse matrix of R is R', and to simplify the notation, set 
|R| = det(R). Observe that R > O implies |R| > 0. With these notations we define a 
function p,,r from R” to R by 


1 -(R16-w),0-w) /2 


n 
Gar RIB xeR", (6.3) 


Py,RO) = 
Now we are prepared to answer the above question about the density of Y. 


Proposition 6.1.8. Suppose the normal vector Y is represented as in eq. (6.2) with regu- 
lar A and p € R". Define the positive matrix R by R = AA’. Then py,p, as given in eq. (6.3), 
is the distribution density of Y. In other words, if B « BUR"), then 


ey 1 -(R-p),0-w) / 2 

P{Y ¢ B} = ——~-—— ]e dx. 
We B= GPR / 

B 


Proof: Because Y = AX + pwith X standard normal, Proposition 6.1.2 implies 


P{Y € B} = P{AX +p ¢ B} = P{X « A(B-p} 


1 
A\(B-p) 


for any Borel set B ¢ R". Hereby, B — p denotes the set {b — p : b € B}. 

In the next step we change the variables by setting x=Ay + yw. Then dx= 
|\det(A)| dy, where by assumption det(A) #0 and, moreover, we have y « A\(B - py) 
if and only if x¢ B. Therefore, the last integral transforms to 
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1 


P{Y « B}= LE 


|det(A)|1 / eA OWI dy, (6.4) 
B 


Proposition A.4.1 implies R > 0 and, moreover, 
|R| = det(R) = det(AA’) = det(A) - det(A’) = det(A)?. 

Since |R| = det(R) > 0, this leads to |R|"? = |det(A)], that is, to 

Jdet(A)|? = [RI (6.5) 
Note that 

JAM = p)P = (AM — p), AM =p) = (AYA - pw), =) 5 

which by 
(41)" oA = (47) oA = (40a) “eR4 


implies 


|A*(x — pl? = (R*@&— pw), -p)) - (6.6) 


Plugging eqs. (6.5) and (6.6) into eq. (6.4), we get 


P{Y ¢ B} = Py,r(x) dx 
| 


with p,,p as in eq. (6.3). This completes the proof. o 


Remark 6.1.9. How does Proposition 6.1.8 look like for n = 1? Here Y = aX +p, that is, 
A = (a), and since A has to be regular, this implies a # 0. Hence we get R = AA! = (a’), 
R™! = (a) and |R|"? = |a|. Thus, the density of Y is given by 


: (Rowan )/ : Cw? [2a ve R, 


x) = ——.—e = ——~—e 
Pur) = Gorin (27)? \a| 


This coincides with the result obtained in Example 4.2.2. 


In view of Proposition 6.1.8 we will use the following notation. 
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Definition 6.1.10. A normal vector Y is said to be V’ (u, R)-distributed if p,,r is its 
density, that is, if 


3 1 -(R4 (x-w),(x-w) i 2 

PLY ¢ B} = ——~—— /e axe 
‘ : (2m)"/?|R|? | 

B 


Remark 6.1.11. It follows from Proposition A.4.2 that, given any p ¢ R” and any R>0, 
there exists a normal vector Y that is V (u, R)-distributed. Indeed, write R > O as 
R=AAl and set Y = AX + py with X standard normal. Then Y is V (u, R)-distributed 
by Proposition 6.1.8. 


{Distributions of R”-valued normal vectors} <=> {uweR”, R> 0} 


Example 6.1.12. Assume 
Y, = X,; —X2+3 and Y> = 2X, +X2-2 


for X,, X> independent NV (0, 1)-distributed. Then we get 


1-1 
p= (3,-2) and a-() ve 


iwtet? 4 io) 24 en 
ee he. yp ht a Pa ey ; 


Thus, Y is N’ (u, R)-distributed with p = (3, -2) and R as in eq. (6.7). 
Which density does Y possess? To answer this, we have to compute det(R) and R7!. 
One easily gets det(R) = 9. The inverse matrix of R equals 


pie! 5-1 
9\-1 2) 


Therefore, the distribution density p,,,r of v= (Y1, Y2) is given by 


which implies 
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1 see 
Py,RO; X2) = eee ( 5 (R(x - 3, x2 + 2), 04 - 3, x2 + 2) 


= = exp ( 7 [501 3)? = 2x4 — 3)00 +2) + AX + 24) ; (6.8) 


Figure 6.1: The density given by eq. (6.8). 


For later purposes we have to name the probability measures on (R", B(R")) appearing 
as distributions of normal vectors. 


Definition 6.1.13. Given p « R" and R > 0, the probability measure V(y, R) on 
(R", BUR") is defined by 


1 -3(R0-w),0-p) 
= i eee a 2 
NG RB) = | rue = ——ae |e dx. 
B B 
Nu, R)is called a multivariate normal distribution with! expected value p and 
covariance matrix R. 


According to Definition 6.1.13, we may now formulate Proposition 6.1.8 as follows: 


Proposition 6.1.14. Let Y be arandom vector. Then the following are equivalent. 
1. Yis.N(u, R)-distributed. 

2. Pp= N (yu, R). 

3. There is a regular n x n matrix A with R = AA’ such that Y =AX+ yh. 


Remark 6.1.15. The case R = I, (as in Section A.4 we denote the identity matrix in R” 
by I;,) and p = 0 is of special interest. Because I! = I, and det(In) = 1, we get 


1 2 
= —|x|*/2 n 
P0,I, 0) = Gare e , xeR", 


1 Why they are named in this way will become clear in the next section. 
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This tells us that \/(0, In) is nothing else as the multivariate standard normal distribu- 
tion introduced in Definition 1.9.16. Written as formula, this means 


N(0, 1)®" = N(0, In) - 
More generally, in view of eq. (1.75) it follows that 
Ny, ra = N(w 07 In) 


where ji = (u,...,#) ¢ R" ando > O. In other words, 


Nu, 02)2"(B) = Ni, 02 In)(B) = / eH 20 ay (6.9) 


(2m)"/20" 
B 
For later purposes, the next result is of importance. 


Proposition 6.1.16. Suppose a normal vector Y= (Y1,..., Yn) may be written as 


with an N(0, Iy)-distributed (standard normal) X anda unitary matrix U. Then its 
coordinate mappings Y,, ..., Yj are independent standard normal random variables. 


Proof: The random vector Y is (0, UU)-distributed. But U is unitary, hence, UUT = 
I, and Y is N(0, In) or, equivalently, standard normally distributed. Then the assertion 
follows by Proposition 6.1.2. a 


Example 6.1.17. For 6 «€ [0, 27) define the 2 x 2 matrix U by 
y= cos @ sin@ 
—sin@cos@] ° 
The matrix U is unitary and by Proposition 6.1.16 the vector Y = UXis standard normal. 
In other words, given independent standard normal X; and X), for each @ « [0, 271) the 
random variables 


Y,; :=cos@X,+sinOX, and Y>=-sin@X,+cos@X> 


are independent and standard normally distributed as well. 
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6.2 Expected Value and Covariance Matrix 


We start with the following definition. 


Definition 6.2.1. Let Y = (Yi,..., Yn) be a random vector such that E|Y;j| < oo for 
all 1 <j < n. Then the vector 


EY :=(EYj,..., EYn) = Qi, .-->pn) 


is called the (multivariate) expected value of ve 
IfE ¥? < oo, 1 <j <n, then the matrix 


Covy = (Covyi, ¥))) ie = (EM — Bi)(Y;j - 1) 


n 
ij=l 


is said to be the covariance matrix of Y. 


Remark 6.2.2. It is important to notice that both 2Y¥ and the covariance matrix Cov; 
depend only on the distribution of Y. That is, whenever Py, 7 Py,» then 


tY,=EY) and Covy, = Covy, « 


The next proposition describes the (multivariate) expected value and the covariance 
matrix of a normally distributed vector. 


Proposition 6.2.3. Assume Y = AX+ y for some regular matrix A and yw « IR". Define 


R= (ri) pot as R = AA’. Then the following is valid. 


(1) Wehave ZY = yand Cov; = (Covyi, ¥;)) Ae R. 


n 
ij= 


(2) Givena « R",a ¢ 0, then Y, a) is anormal random variable with expected value 
(u, a) and variance (Ra, a). 

(3) The coordinate mappings Y; are N (uj, rj)-distributed, 1 < i < n, that is, the 
marginal distributions of Y are the probability measures N (ij, ri). 


Proof: By assumption 


n 
¥i= Do ay Xj + ui, T= Lo sog Ths (6.10) 
j=1 


hence, the linearity of the expected value and EX; = 0 imply 


n 
EY; = > ay IX; +Mi=Mi, 1sisn. 
jel 


This proves EY = (EY;,...,EY,) =p. 


6.2 Expected Value and Covariance Matrix ——= 257 


Let us now verify the second part of property (1). Using yj; = EY;, by representation 
(6.10) we get 


Cov(¥;, Yj) = El(¥i — wi (¥j - yj)] = (Saux) (2m) 
k=l = 


n 
= s jx EX,X) . 
k,l=1 


The Xjs are independent NV (0, 1)-distributed, hence 


1:k=l 
O:k#l’ 


1X,X] = | 
leading to 


n 
Cov(Y¥i, Yj) = So atinct = ry. 
kl 


To see this, recall that R = AA’, hence rij = ya inj. This proves Cov; = Ras 
asserted. 

To verify property (2) we first treat a special case, namely that the vector is 
standard normally distributed. So suppose that XisN (0, In)-distributed. In this case, 
property (2) asserts the following. For any b <« R", b #0, 


(x, b) is distributed according to (0, |b|*) . (6.11) 


If b = (bi,..., Dn), then 
. n n 
(x, b) = bx = ye 


jel jel 
with Z; = b;X;. The random variables Z;,...,Z, are independent and, moreover, by 
Proposition 4.2.3, the Zjs are N’(0, b:)-distributed. Proposition 4.6.18 implies that 
n n 
s: Z; is distributed according to V (0, 3 b}) : 
jel jel 


In view of Le be = |b|* this proves assertion (6.11). 
Let us now turn to the general case. Recall that 


Y=AX+y 
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and R = AA’. Ifa ¢ Ris anonzero vector, then we take the scalar product with respect 
to aon both sides of the last equation and obtain 


Y,a = AX,a + (ul, a) = X,A'’a + (u,a). 
(Ya) = (AR a) + Gia) = (XAT) + 


An application of statement (6.11) with b = A’a lets us conclude that (x, a7a) is 


N(0, |A’a|)-distributed, that is, (¥, a) is N((u, a) , |A7a|?)-distributed. Here we used 
that A, hence also A’, are regular, so that a # O yields b = A’a # 0, and statement (6.11) 
applies. Assertion (2) follows now by 


|A’al/? = (47a, A7a) = (44a, a) = (Ra,a). 


Property (3) is an immediate consequence of the second one. An application of 
property (2) to the ith unit vector e; = (0,...,0, 1 ,0,...,0)inR” leads on one side to 


i 


and on the other side to 
(Rej,ei) =ri and (p,e))=Mi, 1<i<n. 


Thus, by property (2), for each i < n the random variable Yjs is NV (4, rij)-distributed. 
This completes the proof. o 


Corollary 6.2.4. If Y is N (yu, R)-distributed, then nY = pand Cov;= R. 


Proof: Choose any regular n x n matrix A such that R = AA’. The existence of such 
an A is proved in Proposition A.4.2. Set Z = AX+ py for some standard normal vec- 
tor X. Then Y as wellas Z are both V (u, R)-distributed, hence Z d Y. Proposition 6.2.3 
implies a7 = pand Cov; = R. Consequently, by Remark 6.2.2 follows 


iY = aZ = and Cov; =Cov;=R, 
which completes the proof. a 
In view of property Corollary 6.2.4 we will use the following notation. 

Definition 6.2.5. If Y is V’ (u, R), distributed, then the parameters p and R are 


called the (multivariate) expected value and the covariance matrix of Y, 
respectively. 
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Remark 6.2.6. We proved above that for any normal vector Y the coordinate mappings 
Y; = (¥, ei) are normal as well. The converse is not valid. There are random vectors Y 


with all random variables (¥, ei) normal, 1 <i <n, but Y is not so. 
In contrast to this remark, the following is valid. 
Proposition 6.2.7. IY, a) is normal for all nonzero a € R", then Y is normal as well. 


Idea of the proof. By assumption, for each a # O there are real numbers pig and 0g > O 
such that (¥, a) is (ua, 02)-distributed. In order to prove the proposition, one has to 
show that there are aw « R” with wa = (u,a) and an R > O such that o2 = (Ra,a), 
a € R". The existence of the vector yu easily follows from 


Haa+Bb = EY, aa + Bb) = ait. (Y, a) +BY, b) = Alla + Bub, 


using the fact that each linear mapping from R” to R is of the form a + (a,p) fora 
suitable p € R". 

The existence of an R > 0 with o2 = (Ra, a) is consequence of a representation 
theorem for positive quadratic forms on R". To this end, one has to show that a + 02 


4 142 
is a positive quadratic form, which follows by using 02 = E (¥, a) ; 


As we saw above (cf. Proposition 5.3.10), independent random variables are un- 
correlated. On the other hand, Examples 5.3.12 and 5.3.15 showed the existence of 
uncorrelated variables that are not independent. Thus, in general, the property of 
being uncorrelated is weaker than that of being independent. 

One of the basic features of normal vectors is that for them uncorrelated coordin- 
ate mappings are already independent. This somehow explains why in the common 
speech these properties are synonymies. 


Proposition 6.2.8. Let Y = (Yi,..., Yn) be a normally distributed vector. Then the 
following are equivalent. 

(1) Y,..., Y, are independent. 

(2) Y,..., Y, are uncorrelated. 

(3) The covariance matrix Cov; is a diagonal matrix. 


Proof: The implication (1) = (2) follows by Proposition 5.3.10. If the Y;s are uncorrel- 
ated, then this tells us that Cov(Y;, Y;) = 0 whenever i # j. Thus, Cov; is a diagonal 
matrix, which proves (2) > (3). 

It remains to verify (3) > (1). Thus assume that YisN (u, R) distributed, where 
R > Oisa diagonal matrix. Let r1,,...,%,n be the entries of R at the diagonal. Define A 
as diagonal matrix with Pca a8 ne on the diagonal. Note that R > 0 implies r;; > 0, 
hence A is well-defined. Of course, then AA? = R, hence Y has the same distribution 


as the the vector (Z,,...,Z,) with 
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1/2 . 
Zi="j, XitMi, 1sisn, 


where Xj, ...,X, are independent standard normal. Proposition 4.1.9 lets us conclude 
that Z;,..., Z, are independent normal random variables. But since Y u Z , the random 
variables Y;,..., Y, are independent as well’. fai 


Remark 6.2.9. Another property, being equivalent to those in Proposition 6.2.8, is as 
follows. The density function of Y is 


! — Dj og- yr 


Se X= (%,..-,Xn)- 
(27)"/2|R\¥2 ( af n) 


Pur) = 
Note that |R| = det(R) =1y1.- - -Tnn- 


Finally, we investigate the case of two-dimensional normal vectors more thoroughly. 
Thus assume Y = (Yj, Y2) is anormal vector. Then the covariance matrix R is given by 


_ VY; Cov(Y;, Y>) 
~ \ Cov(¥1,¥2)  VY5 


Let a and 05 be the variance of Y, and Yo, respectively, and let p := p(Y1, Y2) be their 
correlation coefficient. Because of 


Cov(¥i, Y2) = (VY1)!?(VY2)"” (V1, Y2) = 0102 


we may rewrite R as 


2 
pa] “1 pe 
P0102 05 
This implies det(R) = o703(1 — p?). Since of > 0, the matrix R is positive if and only if 
|p| < 1. The inverse matrix R-! can be computed by Cramer’s rule as 


2 1 =p 

1 1 05 -p0;02\_ 1 oF 9109 
~ g292(1 — p2) \ — 2 ~4-p2\ = 1 
0705(1—p?) \-poj02 0; ip | a Zz 


Consequently, 


1 (xe 2pxyx. x5 
-1 1 142 2 2 
(Rx, x) = 5 +4), x=(4,x)€R’. 
oO; 0102 05 


2 Indeed, use the characterization of independent random variables given in Proposition 3.6.5. The 
condition stated there depends only on the joint distribution. 
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If u = (4, yo) = (EY, EY2) denotes the expected value of Y, then for a, < b, and 


a < bo, 
1 
P{a, < Y; < by, a2 < Yo < bo} = 
{ay < Yy < by, a2 < Y2 < bp} ntl _p) Pov; x 
by b2 5 
( 1 E —Md)? = 2p 1 — wi) 2 — pa) 
x exp 
2(1 - p?) oY 0102 
ay a 


_ 2 
 G2= be)! ba |)e dx; . 
97 


Compare this with the case of independent Y; and Y>. Here it follows that 


P{a; < Y; < bi, a2 < Yp < bo} 


by b2 
1 = 1)? — wy 
_ 1 | fe( E um : (x2 a |) da 
2710102 2 0; 05 
ay a2 


(6.12) 


(6.13) 


It is worthwhile to mention that in both cases (dependent and independent) the mar- 
ginal distributions are the same, namely (14, 07) and M(t, 05). A comparison of 


eqs. (6.12) and (6.13) shows clearly the influence of the correlation coefficient to the 
density (Fig. 6.2). 


Figure 6.2: yd = M2 = 0, 01 = 2,02 = Landp = 0, p = 1/4, p = 3/4, and p = -1/2 from top left to bottom 


right. 
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6.3 Problems 


Problem 6.1. Let Y = (Yi,..., Yn) be an arbitrary (not necessarily normal) random 

vector. 

1. Show that E|Yj| < 00,1 <j < n, if and only if z/Y| < oo. Here lY| denotes the 
Euclidean distance of Y. 

2. Let A be an arbitrary n x n matrix. Prove that 


E(AY) = A(EY) 


provided that “/Y| < 00. 
3. Show that E|Y;|? < 00,1 <j <n, ifand only if E|Y|? < oo. 


4. Suppose E yl < oo. Let Cov; be the covariance matrix of Y. Prove that Cov; is 
non-negative definite, that is, 


(Covgx,x)>0, xeR". 


Problem 6.2. Let X = (X;, X2) be a two-dimensional standard normal vector. Compute 
P{|X| < 1} = P{X? +X} < I}. 


Hint: Compare the proof of Proposition 1.6.6. 


Problem 6.3. Let X1,...,Xnim bea sequence of independent standard normal random 


. : n . 
variables. For an n x n matrix A = (aj); _, and an m x m matrix B = (Bui)x.)-1 define two 


normal vectors Y and Z by 


n m 
Y; = ayX; and Z,= Y > BuXten , 
jel l=1 


1<i<nand1l<k<m. Let (Y, Z) be the (n + m)-dimensional vector 
(Y,Z) = (Y%, sey Yn, Z15 on isZm) 


Why is (Y, Z) normal? Show that the covariance matrix Cov (yz) is given by 


Cov; 0O 
a Yy 
Covy 3) = ( 0 - . 
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Problem 6.4. Let X;, X2, and X3 be three standard normal independent random vari- 
ables. Define the random vector Y by 


Y:= (X, -— 1, X, + X2 - 1, X, + Xp + X3-1). 


1. Argue why Y is normal. Determine its expected value, its covariance matrix, and 
the correlation coefficients p(Y;, Y;),1< i<j <3. 
2. Determine the distribution density of Y. 


Problem 6.5. The random vector Y = (Y%1,..., Yn) is Vy, R)-distributed for some p € 
R" and R > O. Determine the distribution of Y; +- - -+ Yy. 


Problem 6.6. Prove the following assertion: If Vis N (0, R)-distributed, then there ex- 
ist an orthonormal basis Ain in R", positive numbers A;,...,A, and independent 
N (0, 1)-distributed &,..., &, such that 


Y=) AGf- (6.14) 


j-l 


Hint: Use the principal axis transformation for symmetric matrices and the fact that 
unitary matrices map an orthonormal basis onto an orthonormal basis. 

Conclude from eq. (6.14) the following: If Y is N (0, R)-distributed, then there are 
a,...,@n in R" such that (¥, a) er (¥, ar) is a sequence of independent standard 
normal random variables. 


Problem 6.7. Then-dimensional vector Y is distributed according to N(y, R). For some 
regular n x n matrix S define Z by Z := SY. Is Z normal? If this is so, determine the 
expected value and the covariance matrix of Z. 


Problem 6.8. Let X = (X,, X2) be standard normal. Define random variables Y; and 
Y by 


1 1 
Yi := Bh +X) and Y>:= —(X,-X). 


ep wp 


Why are Y; and Y> also independent and standard normal? 


7 Limit Theorems 


Probability Theory does not have the ability to predict the occurrence or nonoccur- 
rence of a single event in a random experiment; besides, this event occurs either with 
probability one or with probability zero. For example, Probability Theory does not 
give any information about the next result when rolling a die, it does not predict the 
numbers appearing next week on the lottery nor is it able to foresee the lifetime of a 
component in a machine. Such statements are impossible within the theory. The the- 
ory is only able to say that some events are more likely and others are less likely. For 
instance, when rolling a die twice, it is more likely that the sum of both rolls will be 
“7” than “2.” Nevertheless, next when we roll the die the sum may be “2,” not “7.” The 
event “the sum is 2” is not impossible, only less likely. 

In contrast, Probability Theory provides us with very precise and far-reaching in- 
formation about the behavior of the results when we execute “many” identical random 
experiments. As already said, we cannot tell anything about the expected number ona 
die when we roll it once, but we are able to say a lot about the frequency of the number 
“6” when rolling a die many times, namely that, on average, this number will appear 
in one of six cases (provided the die is fair). In this example, certain laws of Probab- 
ility Theory, which we will present in this section, are operating. These laws are only 
applicable in the case of many experiments, not in that of a single one. 

Limit theorems in Probability Theory belong to the most beautiful and most im- 
portant assertions within this theory. They are always the highlight of a lecture about 
advanced Probability Theory. However, their proofs require a longer comprehensive 
mathematical explanation, which is impossible to give here within the frame of this 
book. For those who are interested in knowing more about this topic, they may look 
into one of the more advanced books, such as [Bil12], [Dur10], or [Kho07]. Although 
the proofs of the limit theorems are mostly quite complicated, they are very important, 
and their consequences influence our daily lives. Moreover, great parts of Mathemat- 
ical Statistics are based on these results. Therefore, we decided to state here the crucial 
assertions without proving most of them. Thus, our main focus is to present the most 
important limit theorems, to explain them in detail, and to give examples that show 
how they apply. If possible, we give some hint as to how the results are derived, but 
mostly we must resign to prove them. 


7.1 Laws of Large Numbers 
7.1.1 Chebyshev’s Inequality 


Our first objective is to prove Chebyshev’s inequality. To do so, we need the following 
lemma. 
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Lemma 7.1.1. Let Y be a non-negative random variable. Then for each A > 0 it 
follows that 


= 


P{Y >A}< : (7.1) 


Proof: Let us first treat the case that Y is discrete. Since Y > 0, its possible values 
y1, V2, ... are nonnegative real numbers. Therefore, we get 


bY = by PAY = yj} = DPI = vib 


j=l yj2A 
>A) PLY = yj} = AP{Y > A}. 
yj2A 


Solving the inequality for P{Y > A} proves inequality (7.1). 

The proof of estimate (7.1) for continuous Y uses similar methods. If g denotes the 
distribution density of Y, by Y > 0 we may suppose q(y) = 0 for y < 0. Then, as in the 
discrete case, we conclude that 


co co co 


uy = [rare [vey >A [aay-arr > A}. 
0 A A 
From this inequality (7.1) follows directly. a 


Remark 7.1.2. Sometimes it is useful to apply inequality (7.1) in a slightly modified 
way. For example, if Y > 0 and a > 0, then one derives 


Tye 


Aa * 


P{Y >A} = P{Y* >A} < 


Or, if Y is real valued, then for A « R we obtain 


1e 
PLY <]} -Ple¥ > eA} < -elKe’, 


Now we are in a position to state and to prove Chebyshev’s inequality. 


Proposition 7.1.3 (Chebyshev’s inequality). Let X be a random variable with finite 
second moment. Then, if c > 0, it follows that 


Vx 
P{|X - EX|>c}< = (7.2) 
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Proof: Setting Y := |X —- EX|*, we have Y > 0 and EY = VX. Now apply inequality (7.1) 
to Y with A = c’. This leads to 


iY VX 
eC” 


P{|X — EX| > c} = P{|X -EX/? > C?}=P{Y > c*}< 


and estimate (7.2) is proven. oO 


Interpretation: Inequality (7.2) quantifies the interpretation of VX as a measure for the 
dispersion of X. The smaller YX, the less the values of X vary around its expected 
value EX. 


Remark 7.1.4. Another way to formulate inequality (7.2) is as follows. If x > 0, then 


1 
P{|X — EX| > x (VX)"?} < a" 
To see this, apply inequality (7.2) with c = x (VxX)"?. 


Example 7.1.5. Rolla fair die n times. We are interested in the relative frequency of the 
occurrence of the event A := {6}. Recall that this frequency was defined in eq. (1.1). 
Moreover, we claimed in this section that limy_... (A) = P(A) = z- Is it possible to 
estimate the probability for |r,(A) — Z| being bigger than some given c > 0? 

Answer: Define the random variable X as the absolute frequency of the occurrence 
of A, that is, we have X = k for some k = 0, ... ,n provided that A occurred exactly 
k times. Then X is binomial distributed with parameters n and p = 1/6. To see this, 
define “success” as appearance of “6.” Consequently, the relative frequency can be 
represented as 1,(A) = # . An application of eqs. (5.7) and (5.33) gives 


1 1 npi-p) 5 
DT n(A =X d A) = = . 
cae 7 2. va) ne 36n 


Thus, inequality (7.2) leads to 


"| 


If, for example, n = 10°, and if we choose c = 1/36, then Chebyshev’s inequality yields 


Z 5 
~ 36c2n° 


Tn(A) - 4 > cf 


5 7 9 
Phe Dest =08), 
{5 < Tvl <5 50 


For the absolute frequency this means 


P{139 < ayo3 < 194} > 0.82. 
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Let us interpret the result. Suppose we roll a fair die 1000 times. Then, with a 
probability of at least 82%, the frequency of “6” will be between 139 and 194. 


Let us present a second quite similar example. 


Example 7.1.6. Roll a fair die n times and let S, be the sum of the n results. Then S, = 
X,+-+++Xy, where X;, ... ,X, are uniformly distributed on {1, ... ,6} and independent. 
By Example 5.2.17 we know that 


7m 35n 
Sn = uX1+ sie tae iXn = 2D and VSn = VX, + otis + VXn = a sy, 


12 
(#2) =5 and v(*)-=. 
n 2 n 12n 


An application of inequality (7.2) leads then to 


35 
>cl< ; 
12nc2 


hence 


eee 
n 2 


For example, if n = 10? and c is chosen as c = 0.1, then 


P{3a< 12 <36)>0709. 
103 
The interpretation of this result is as in the previous example. With a probability larger 
than 70% the sum of 1000 rolls of a fair die will be a number between 3400 and 3600. 


7.1.2 *Infinite Sequences of Independent Random Variables 


Whenever one wants to describe the limit behavior of random variables or random 
events, one needs a model for the infinite performance of random experiments. Other- 
wise, we cannot investigate limits or other related quantities. This is comparable with 
similar investigations in Calculus. In order to analyze limits, infinite sequences are 
necessary, not finite ones. Thus, for the examination of limits of random variables we 
need an infinite sequence Xj, X2, ... of random variables, which are, on one hand, in- 
dependent in the sense of Definition 4.3.4 and, on the other hand, possess some given 
probability distributions. 


Example 7.1.7. In order to describe the infinite tossing of a fair coin we need inde- 
pendent random variables X;, X, ... such that P{X; = 0} = P{X; = 1} = 5. Or, similarly, 
for a model of rolling a die infinitely often we need infinitely many independent 
random variables all uniformly distributed on {1, ... , 6}. 
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In Proposition 4.3.3 we presented the construction of independent Xi), distrib- 
uted according to By 1/. This technique can be extended to more general sequences 
of random variables, but it is quite complicated. Another, much smarter way is to use 
so-called infinite product measures. Their existence follows by a deep theorem due 
to A. N. Kolmogorov. As a consequence one gets the following result, which cannot 
be proven within the framework of this book. We refer to [Kho07], Chapter 5, §2, for 


further reading. 


Proposition 7.1.8. Let P,,P2, ... be arbitrary probability measures on (R, B(R)). Then 

there are a probability space (Q,.A, P) and an infinite sequence of random variables 

Xj; : O. > R such that the following holds. 

1. The probability distribution of X; is Pj, j = 1,2, .... That is, for allj > 1 and all 
B < B(R) it follows that 


P{X; € B} = P;(B) . 


2. The random variables X;, X2, ... are independent in the sense of Definition 4.3.4. 
This says, for alln > 1and all B; ¢ B(R) it follows that 


P{X, € By, ... ,Xn € Bn} = P{X € Bi} ++ P{Xn € Bn} = Pi(B1)--- Pn(Bn). 


Of special interest is the case P; = P2 = --- = Po for a certain probability measure Po 
on R. Then the previous proposition implies the following. 


Corollary 7.1.9. Given an arbitrary probability measure Po on B(R), there are random 
variables X;, Xz, ... such that for alln > 1 and all B; « B(R) 


P{X; € By, ... ,Xn € Bn} = Po(Bi)--- Po(Bn). 


Example 7.1.10. Choosing as Po the uniform distribution on [0, 1], the previous corol- 
lary ensures the existence of (independent) random variables X,, X2, ... such that for 
alln>landallO<aj<bj<1 


n 
Pla, < X, <b, ...,an < Xn < bn} = | [j- 4). 
jel 


The sequence Xj, Xz, ... models the independent choosing of infinitely many num- 
bers uniformly distributed in [0, 1]. 


Remark 7.1.11. One may ask whether the kind of independence in Definition 4.3.4 suf- 
fices for later purposes. Recall, we only require X, ... ,X, to be independent for all 
(finite) n > 1. Maybe one would expect a condition that involves the whole infinite se- 
quence, not only a finite part of it. The answer is that such a condition for the whole 
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sequence is a consequence of Definition 4.3.4. Namely, if Bj, Bo, ... are Borel sets in 
R, then, by the continuity of probability measures from above, it follows that 


P{X, € By, X2 € Bo, euch lim P{X; € Bi, 3 ais »Xn € By} 
n->oo 


n co 
= lim P{X € By} ---P{Xn € Bn} = lim | | P(X; « Bi} = | | PAX « Bi. 
jel j=l 
In particular, this implies 


P{a, <X, < by,a) < Xo <b, .. af = | [P@ < Xj < bj}. (7.3) 
jel 


Example 7.1.12. Let X;, X, ... be a sequence of independent EF,-distributed random 
variables for some A > O. Given real numbers a; > O, we ask for the probability of 


P{X; < 01,X2 < a, bees 


Answer: If we apply eq. (7.3) with a; = 0 and with b; = a;, then we get 
P{X, < @,X2< a, ...} = [[®& < ay} = I] [1 - e] : 
jel jel 


Of special interest are sequences (q;);51 such that the infinite product converges, that 
is, for these sequences (a;);>1 we have Tn E ey | > 0. This happens if and only if 


In I] E = e] 7 2 In[1 — eA4)] > —00. (74) 
jel jel 
Because of 
lim ni) =1, 


x30 —-X 


by the limit comparison test for infinite series, condition (74) holds if and only if 
> e"4) < 0, 
jel 


If, for example, a; = c - In(j + 1) for some c > 0, then 
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This sum is known to be finite if and only if Ac > 1, that is, ifc > 1/A. 
Another way to formulate this observation is as follows. It holds 

P {sup ze <cp =P {X; <c InG+1) viz t}=[](1 : ) 
joi InG+1) ~ Ly a (j+ irc} ’ 


j=l 


and this probability is positive if and only if c > 1/A. 


7.1.3 * Borel—Cantelli Lemma 


The aim of this section is to present one of the most useful tools for the investigation of 
the limit behavior of infinite sequences of random variables and events. Let (Q, A, P) 
be a probability space and let Aj, Az, ... be asequence of events in A. Then two typical 
questions arise. What is the probability that there exists some n ¢€ N such that all 
events A» with m > n occur? The other related question asks for the probability that 
infinitely many of the events A, occur. 

To explain why these questions are of interest, let us once more regard Example 
4.1.7 of the random walk. Here S, denotes the integer where the particle is located after 
n random jumps. For example, letting A, := {w « Q : S,(w) > 0}, then the existence 
of ann ¢ N, such that A», occurs for all m > n, says that the particle from a certain 
(random) moment attains only positive numbers and never goes back to the negative 
ones. Or, if we investigate the events B, := {w ¢ Q: S,(w) = O}, then the B,s occur 
infinitely often if and only if the particle returns to zero infinitely often. Equivalently, 
there are (random) n, < nz < --- with Sn; (w) =0. 

To formulate the two previous questions more precisely, let us introduce the 
following two events. 


Definition 7.1.13. Let A;, A>, ... be subsets of QO. Then 


lim inf Ay := U ()Am and ar = ‘a |) Am 


n-co 
n=1 m=n n=1 m=n 


are called the lower and the upper limit of the A,,s. 


Remark 7.1.14. Let us characterize when the lower and the upper limit occur. 

1. An element w € Q belongs to liminf,... An if and only if there is ann ¢ N such 
that w ¢ Cee Am, that is, if itis an element of Am for m > n. In other words, the 
lower limit occurs if there is an! n € N such that after n the events Am always occut. 


1 Note that this n is random, that is, it may depend on the chosen w € Q. 
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Therefore, we say that lim inf,_,.. An occurs if the A,s finally always occur. Thus, 


P{w ¢ QO: Ins.t. w €Am,m > n} =P (lim inf An) : 
n-oo 


2. Anelement w « Q belongs to limsup,,,,, An if and only if for each n € N there is 
an m > nsuch that w € Am. But this is nothing else as to say that the number of 
Ans with w ¢ Ap, is infinite. Therefore, the upper limit consists of those elements 
for which we have infinitely often w ¢ A,. Note that also these events may be 
different for different ws. Thus, 


P{w ¢Q:w€ Ay, for infinitely many n} = P (tim sup An) : 
n-oco 
Example 7.1.15. Suppose a fair coin is labeled on one side with “O” and on the 
other side with “1.” We toss it infinitely often. Let A, occur if the nth toss is “1.” 
Then liminf,... An occurs if after a certain number of tosses “1” always shows up. 
On the other hand, limsup,.,,, An occurs if and only if the number “1” appears 
infinitely often. 


Let us formulate and prove some easy properties of the lower and upper limit. 


Proposition 7.1.16. If A;, Az, ... are subsets of Q, then 


(1) liminfA, ¢ limsup Ay, 


_ -— ; 
(2) (im sup An) =liminfA, and (Jim inf Ay) =limsupAy. 
Noo n= oo R= Noo 
Proof: We prove these properties in the interpretation of the lower and upper limit 
given in Remark 7.1.14. 

Suppose that w ¢ liminf,_,.. An. Then for some n > 1 it follows that w € Am, 
m > n. Of course, then the number of events with w ¢€ Ay, is infinite, which implies 
w € limsup,_,o. An. This proves (1). 

Observe that we have w ¢ limsup,.,,, An if and only if w € A, for only finitely 
many n ¢ N. Equivalently, there is an n > 1 such that whenever m > n, then w ¢ Am, Or, 
that w ¢ Af. In other words, this happens if and only if w ¢ liminf,.... Af. This proves 
the left-hand identity in (2). The second one follows by the same arguments. One may 
also prove this by applying the left-hand identity with Af. a 


Before we can formulate the main result in this section, we have to define when an 
infinite sequence of events is independent. 
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Definition 7.1.17. A sequence of events Aj, Ap, ... in A is said to be independent 
provided that for all n > 1 the events Aj, ... , An are independent in the sense of 
Definition 2.2.12. 


Remark 7.1.18. Using the method for the proof of eq. (7.3) one may deduce the 
following “infinite” version of independence. For independent A, A>, ... follows that 


P (a) = T][PtAn). 
n=1 n=1 


Remark 7.1.19. According to Proposition 3.6.7, the independence of random variables 
and events are linked as follows. 

The random variables X;, X2, ... are independent in the sense of Definition 4.3.4 
if and only if for all Borel sets B,, Bz, ... in R the preimages X; 1(B,), X; 1(B>), ... are 
independent events as introduced in Definition 7.1.17. 


Now we are in the position to state and prove the main result of this section. 


Proposition 7.1.20 (Borel—-Cantelli lemma). Let (Q, A, P) be a probability space and let 
Ape Ay n= 15.23. sc: 
1. If )°¥°, P(An) < 09, then this implies 


P(lim sup A;,) =0. (7.5) 


n-oo 


2. Forindependent Aj, Ap, ... the following is valid. If )~"°, P(An) = co, then 


P(lim sup A,) = 1. 


n- oo 


Proof: We start with proving the first assertion. Thus, take arbitrary subsets A, ¢ A 
satisfying )*°, P(An) < oo. Write 


lim sup Ay = () Bn 
n-oco n=1 
with Bn := Up, Am- Since By 2 Bz 2 ---, property (7) in Proposition 1.2.1 applies, and 
together with (5) in the same proposition this leads to 


n-oo 


P(im sup An) = Jim P(Bn) < lim inf SS P(Am). (7.6) 
m=n 
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If a, a, ... are non-negative numbers with }°°, an < oo, then it is known that 
hen Mm > Oasn > oo. Applying this observation to a, = P(Ay), assertion (7.5) is 
a direct consequence of estimate (7.6). Thus, the first part is proven. 

To prove the second assertion we investigate the probability of the complementary 
event. Here we have 


(lim sup Ay)° = U () Ayn 


Tisres n=1m=n 


An application of (5) in Proposition 1.2.1 implies 


P((lim sup A,)*) < 3 P( a A.) (77) 


n-oo 


Fix n ¢ N and fork > nset By := ‘eo A‘. Then By 2 Bny1 2 ---, hence by property (7) 
in Proposition 1.2.1 it follows that 


(4%) P( 0) = jim Pi) = jim Ie —P(Am)). 


Here we used in the last step that, according to Proposition 2.2.15, the events 
Aj, AS, ... ate independent as well. Next we apply the elementary inequality 


1-x<e*, O<x<l, 


for x = P(Am), and because of }°_, P(Am) = 00 we atrive at 
oo k 
P( () A‘) < lim sup exp (- x: PU) =0. 
m=n k->o0 m=n 


Plugging this into estimate (7.7) finally implies 


P( (lim sup An)°) =0, hence P(limsupA,) =1 


noo noo 
as asserted. o 
Remark 7.1.21. The second assertion in Proposition 7.1.20 remains valid under the 


weaker condition of pairwise independence. But then the proof becomes more 
complicated. 
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Corollary 7.1.22. Let A, ¢ A be independent events. Then the following are equivalent. 


P(lim sup An) =0 <= 2 P(An) < 00 
n->oco n=l 


P(imsup Ay) =1 << = P(An) = 
TExeo n=1 


Example 7.1.23. Let (Un)n>; be a sequence of independent random variables, uni- 
formly distributed on [0, 1]. Given positive real numbers (@n)n>1, we define events Ay, 
by setting Ay := {Un < an}. Since the U,s are independent, so are the events Ay, and 
Corollary 7.1.22 applies. Because of P(A;,) = a, this leads to 


0: oan < 00 


P{Un < Qn infinitely often} = wa 
1: 0721 an = 00 


or, equivalently, to 


0: done1 An = 00 


P {Un > Qn finally always} = 
1: Dene An < 00. 


For example, we have 

P {Un < 1/n infinitely often} =1 and P {Un < 1/n? infinitely often} = 0. 
Example 7.1.24. Let (Xn)n>1 be a sequence of independent NV (0, 1)-distributed random 
variables and let c, > 0. What probability does the event, to observe infinitely often 


{|Xn| > Cn}, possess? 
Answer: It holds 


>| PUlXnl = Cn} = oe Je eX 2 dx “Fmd 3 (cn), 
n=1 ¢ 


n=1 


where 


Setting W(t) := t! e 2 ¢> 0, then 


1 
y'(t) = -e PR and w'(t) = (1 + z) are : 
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hence |’H6pital’s rule implies 


The limit comparison test for infinite series tells us that }°*°, p(cn) < oo if and only if 
yi W(cn) < co. Thus, by the definition of w the following are equivalent. 


bad enon 


Y> PUIXnl >Cn}< co <> 


n=1 n=1 


In other words, we have 


P{|Xy| > Cy infinitely often} = : ae 


For example, if cy, = c/Inn for some c > O, then 


ai en c ne. Jinn 


if and only if c > V2. In particular, this yields the following interesting fact: 
P{|Xp| > V2Inn infinitely often} = 1, 

while for each c > 2 
P{|X;| > Vc Inn infinitely often} = 0. 


From this we derive 


: Xn(w)| | 
PiweQ:limsu =J24=1, 
| Ce. Vinn 


Example 7.1.25. Ina lottery 6 of 49 numbers are randomly chosen. Find the probabil- 
ity to have infinitely often the six chosen numbers on your lottery ticket. 

Answer: Let Ay be the event to have in the nth drawing the six chosen numbers on 
the ticket. We saw (compare Example 1.4.3) that 


P(Ay) = = 6>0. 


= 
() 
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Consequently, it follows }°~, P(An) = oo, and since the A,s are independent, 
Proposition 7.1.20 implies 


P{A, infinitely often} = 1. 


Therefore, the event to win infinitely often has probability 1. One does only not play 
long enough! 


Remark 7.1.26. Corollary 7.1.22 shows particularly that for independent Ans either 


P(dimsupA,)=0 or P(limsupA,) =1. 
noo noo 

Because of Proposition 7.1.16 the same is valid for the lower limit. Here operate so- 
called 0-1 laws, which, roughly spoken, assert the following. Whenever the occurrence 
or nonoccurrence of an event is independent of the first finitely many results, then 
such events occur either with probability 0 or 1. For example, the occurrence or nonoc- 
currence of the lower or upper limit is completely independent of what had happened 
during the first n results, n > 1. 


7.1.4 Weak Law of Large Numbers 


Given random variables X;, X2, ... let 
Sn = Xy+ aire + Xp (7.8) 


be the sum of the first n values. One of the most important questions in Probability 
Theory is that about the behavior of S, as n > oo. Suppose we play a series of games 
and X; denotes the loss or the gain in game j > 1. Then S,, is nothing else than the total 
loss or gain after n games. Also recall the random walk presented in Example 4.1.7. Set 
X; = —1ifin step j the particle jumps to the left and X; = 1, otherwise. Then S, is the 
point in Z where the particle is located after n jumps. 

Let us come back to the general case. We are given arbitrary independent and 
identical distributed random variables X,, Xz, .... Recall that “identically distributed” 
says that they possess all the same probability distribution. Set S, = X, + -- + Xy.A 
first result gives some information about the behavior of the arithmetic mean S,/n 
asn—> oo. 


Proposition 7.1.27 (Weak law of large numbers). Let X;,X2,... be independent 
identically distributed random variables with (common) expected value p € R. If € > 0, 
then it follows that 
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Proof: We prove the result with only an additional condition, namely that X;, and 
hence all X;, possess a finite second moment. The result remains true without this 
condition, but then its proof becomes significantly more complicated. 

From (3) in Proposition 5.1.36 we derive 


n n n n 


2 (=) _ ES, (X, + ++ + Xp) _ EX,+ +> +EX, — ny 
“\in 


Furthermore, by the independence of the Xjs, property (iv) in Proposition 5.2.15 also 
gives 


Sr VSn VX, Sei VX VX, 
V - - = ; 
n n2 n2 n 


Consequently, inequality (7.2) implies 


‘i 


and the desired assertion follows by 


Sn 


yl >el < VSuln) _ VX 
n 


ze} < e2 ne2’ 


timsupP {|= -y 


n-co 


_ VX 
>eE < lim —, = 0. 


Remark 7.1.28. The type of convergence appearing in Proposition 7.1.27 is usu- 
ally called convergence in probability. More precisely, given random variables 
Y;, Y2, ..., they converge in probability to some random variable Y provided that for 
each € > 0 


Jim P{l¥n -Y| >e}=0. 


Hence, in this language the weak law of large numbers asserts that S,,/n converges in 
probability to a random variable Y, which is the constant yp. 


Interpretation of Proposition 7.1.27 : Fix > 0 and define events Ay, n > 1, by 


m <e} . 


Then Proposition 7.1.27 implies lim,... P(Ay) = 1. Hence, given 6 > 0, then there is an 
No = No(E, 6) such that P(A,) > 1-6 whenever n > no. In other words, if nis sufficiently 
large, then with high probability (recall, js is the expected value of the X;s) 


tn foca: | 


1 n 
Se = XV SMHE. 
H n Hy 
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This confirms once more the interpretation of the expected value as (approximat- 
ive) arithmetic mean of the observed values, provided that we execute the same 
experiment arbitrarily often and the results do not depend on each other. 


7.1.5 Strong Law of Large Numbers 


Proposition 7.1.27 does not imply S;,/n > py in the usual sense. It only asserts the 
convergence of S;/n in probability, which, in general, does not imply a pointwise 
convergence. The following theorem shows that, nevertheless, a strong type of con- 
vergence takes place. The proof of this result is much more complicated than that of 
Proposition 7.1.27. Therefore, we cannot present it in the scope of this book, and we 
refer to [Dur10], Section 2.4, for a proof. 


Proposition 7.1.29 (Strong law of large numbers). Let X;, X2, ... be a sequence of in- 
dependent identically distributed random variables with expected value up = EX). If Sp 
is defined by eq. (7.8), then 


S 
Plwea: lim n(w) -n| =1. 
n-co n 
Remark 7.1.30. Given random variables Y;, Y2,... and Y, one says that the Y,s 


converge to Y almost surely, if 


P {tim ¥,- ¥}=P{weQ: lim Yn(w) = Yw)} =1. 
noo noo 

Thus, Proposition 7.1.29 asserts that S,/n converges almost surely to a random variable 
Y, which is constant py. 


Remark 7.1.31. Proposition 7.1.29 allows the following interpretation. There exists a 
subset Qo in the sample space Q with P(Qo) = 1 such that for all w € Oo and alle > 0, 
there is an no = no(e, w) with 


Sn(w) 7 
n 


H)<e 


whenever n > no. 

In other words, with probability one the following happens: given € > 0, then 
there is a certain no depending on w, hence being random, such that for n > no the 
arithmetic mean S,,/n is in an e-neighborhood of y and never leaves it again. 

Let us emphasize once more that S,,/n is random, hence S,,/n may attain different 
values for a different series of experiments. Nevertheless, starting from a certain point, 
which may be different for different experiments, the arithmetic mean of the first n 
results will be in (u-¢€,u +6). 
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When we introduced probability measures in Section 1.1.3, we claimed that the 
number P(A) may be regarded as limit of the relative frequencies of the occurrence of 
A. As a first consequence of Proposition 7.1.29 we show that this is indeed true. 


Proposition 7.1.32. Suppose a random experiment is described by a probability space 
(Q, A, P). Execute this experiment arbitrarily often. Given an event A «€ A, let rnj(A) be 
the relative frequency of A inn trials as defined in eq. (1.1). Then almost surely 


Jim Tn(A) = P(A). 


Proof: Define random variables Xj, Xz, ... as follows. Set X; = 1 if A occurs in trial 
j, while X; = 0 otherwise. Since the experiments are executed independently of each 
other, the X;s are independent as well. Moreover, we execute every time exactly the 
same experiment, hence the Xjs are also identically distributed. 

By the definition of the X;s, 


Sn = Tn(A) . 
n 


Thus, it remains to evaluate py = EX;. To this end observe that the Xjs are Byy- 
distributed with success probability p = P(A). Recall that X; = 1if and only if A occurs 
in experiment j, and since the experiment is described by (Q, A, P), the probability for 
X; being one is P(A). Consequently, EX; = P(A). 

Proposition 7.1.29 now implies that almost surely 


Jim Tn(A) = Jim Sn = EX, = P(A). 


co PY 


This completes the proof. a 


What does happen in the case that the X;s do not possess an expected value? Does then 
S,/n converge nevertheless? If this is so, could we take this limit as a “generalized” 
expected value? The next proposition shows that such an approach does not work. 


Proposition 7.1.33. Let X;,X2, ... be independent and identically distributed with 
1|X1| = oo. Then it follows that 


Sn(w) 


P {w €Q: diverges} =1. 


For example, if we take an independent sequence (Xj);>1 of Cauchy distributed random 
variables, then their arithmetic means S,,/n will diverge almost surely. 
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Remark 7.1.34. Why does one need a weak law of large numbers when there exists 
a strong one? This question is justified, and in fact, in the situation described in 
this book the weak law is a consequence of the strong one, thus, it is not necessarily 
needed. 

The situation is different if one investigates independent, but not necessar- 
ily identically distributed, random variables. Then there are sequences Xj, X, ... 
satisfying the weak law but not the strong one.? 


Let us state two applications of Proposition 7.1.29, one taken from Numerical Mathem- 
atics, the other from Number Theory. 


Example 7.1.35 (Monte Carlo method for integrals). Suppose we are given a quite 
“complicated” function f : [0,1]" ~ R. The task is to find the numerical value of 


[ roow- Ce edie 
(0) 


[o,1)” 0 


For large n this can be a highly nontrivial problem. One way to overcome this difficulty 
is to use a probabilistic approach that is based on the strong law of large numbers. 

To this end, choose an independent sequence Uh Us. ... Of random vectors 
uniformly distributed on [0,1]". For example, such a sequence can be constructed 
as follows. Take independent Uj, U2, ... uniformly distributed on? [0,1] and build 
random vectors by Ui; = (Uj; 225.0); U» = (Un, ... , Uon), and so on. 


Proposition 7.1.36. As above, let Ui Us; ... be independent random vectors uni- 
formly distributed on [0,1]". Given an integrable function f : [0,1]" > R, then, with 
probability one, 


t ia 
Jim | G) - / fd dx. 
jel [o,1)" 


Proof: Set X; := fU), j = 1,2, .... By construction, the X;s are independent and 
identically distributed random variables. Proposition 3.6.18 implies (compare also 
Example 3.6.21) that the distribution densities of the random vectors U; are given by 


1: xe [0,1]" 


1d |e ae 


2 In the case of nonidentically distributed Xjs one investigates if 4 1%} — EX;) converges to zero 
either in probability (weak law) or almost surely (strong law). 
3 Use the methods developed in Section 4.4 to construct such Ujs. 
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As already mentioned in Remark 5.3.5, formula (5.38), stated for a function of two 
variables, also holds for functions of n variables, n > 1 arbitrary. This implies 


EX; = Ef(i;) = / F(x) po) dx = / fix) ax. 
Rn 


[0,1]" 


Thus, Proposition 7.1.29 applies and leads to 


N 

ao. th > we od 7 

P him 5 AG) = f foods =P pag 2 =1 
F for ie 


as asserted. g 


Remark 7.1.37. The numerical application of the preceding proposition is as follows. 
Choose independent numbers a, 1<i<n,1<j<N, uniformly distributed on [0,1] 
and set 


Proposition 7.1.36 asserts that Ry(f) converges almost surely to f f(x)dx. Thus, if 
[0,1)" 
N > 1is large, then Ry(f) may be taken as approximative value for f f(x) dx. 
[0,1)" 
If we apply Proposition 7.1.36 to the indicator function of a Borel set B ¢ [0, 1]", 
that is, we choose f = 1, with 1g as in Definition 3.6.14, then with probability 1 it 


follows that 


#{j < N: Uj € B} 
7 


N 
_ 1 > : 
voln(B) = / Ip@) dx = lim 5 ) 13(Uj) = Jim 
[o,1" - 


This provides us with a method to determine the volume vol,(B), even for quite 
“complicated” Borel sets B ¢ R". 


Example 7.1.38 (Normal numbers). As we saw in Section 4.3.1, each x € [0, 1) admits a 
representation as binary fraction x = 0.x1x2--- with x; € {0, 1}. Take some fixed x « [0, 1) 
with binary representation x = 0.x,x2---. Then one may ask whether in the binary 
representation of x one of the numbers 0 or 1 occurs more frequently than the other 
one. Or do both numbers possess the same frequency, at least on average? 

To investigate this question, for n € N set 


a®(x) = #H{k<n:x,=O0} and al(x):=#{k<n:x,=1, x=O.xx0-- 


Thus, a(x) is the frequency of the number 0 among the first n positions in the 
representation of x. 
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Definition 7.1.39. A number x « [0,1) is said to be normal (with respect to 
base 2) if 


0 1 

a, (x a, (x 1 
lim ne) im nl) _ f 
n>co Nn n-o Nn 2 


In other words, a number x ¢ [0, 1) is normal with respect to base 2 if, on average, in 
its binary representation the frequency of 0, and hence also of 1, equals 1/2. Are there 
many normal numbers as, for example, x = 0.0101010--. or maybe only a few ones? 
Answer gives the next proposition. 


Proposition 7.1.40. Let P be the uniform distribution on [0,1]. Then there is a subset 
M ¢ [0, 1) with P(M) = 1 such that all x « M are normal with respect to base 2. 


Proof: Define random variables X;. : [0,1) ~ R, k = 0,1,..., by Xx(x) := xx whenever 
xX = 0.x1xX2---. Proposition 4.3.3 tells us that the X;,s are independent with P{X;, = 0} = 
1/2 and P{X, = 1} = 1/2. Recall that the underlying probability measure P on [0, 1] is 
the uniform distribution. By the definition of the X;s it follows that 


Snx) := Xy(x) + + + Xn) = Hk <n: X00) = 1} = ay). 


Since EX, = 1/2, Proposition 7.1.29 implies the existence of a subset M ¢ [0,1) with 
P(M) = 1 such that for x ¢ M it follows that 


1 
1 
lim An) _ lim BH) iX,=-. 
n> co n n->oo n 
Since a°(x) = n — ah(x), this completes the proof. r 


Remark 7.1.41. The previous considerations do not depend on the fact that the base of 
the representation was 2. It extends easily to representations with respect to any base 
b > 2. Here, the definition of normal numbers has to be extended slightly. Fix b > 2. 
Each x « [0, 1) admits the representation x = 0.x;x2--- where x; € {0, ... , b—1} provided 
that x = 772) aa To make this representation unique we do not allow representations 
xX = 0.x1xX2-:- where for some ko € N we have xx = b — 1 whenever k > ko. 

Then a number x is said to be normal with respect to the base b > 2 if for all 


£=0,...b-1 


_ Ai<n:x=f} 1 
lim 7 


n->co n 


» X=0.X1xX%2.... 


Similar methods as used in the proof of Proposition 7.1.40 show that there is a set 
My, c [0,1] with P(M;) = 1 such that all x ¢ My are normal with respect to base b. 
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Letting M = (\;2, Mp, then property (5) (Boole’s inequality) in Proposition 1.2.1 easily 
gives P(M) = 1. Numbers x « M are completely normal, which says that they are 
normal for any base b > 2. Again we see that with respect to the uniform distribution 
on [0, 1] almost all numbers are completely normal. 


7.2 Central Limit Theorem 


Why does the normal distribution play such an important role in Probability The- 
ory and why are so many observed random phenomenons normally distributed? The 
reason for this is the central limit theorem, which we are going to present in this 
section. 

Regard a sequence of independent and identically distributed random variables 
(X})jz1 with finite second moment. As in eq. (78) let S, be the sum of Xj,..., X,. For 
example, if X; is the loss or gain in the jth game, then S, is the total loss or gain after 
n games. Which probability distribution does S, possess? Theoretically, this can be 
evaluated by the convolution formulas stated in Section 4.5. But practically, this is 
mostly impossible; imagine, we want to determine the distribution of the sum of 100 
rolls with a fair die. Therefore, one is very interested in asymptotic statements about 
the distribution of Sy. 

To get a clue about possible asymptotic distributions of S,, take independent By p- 
distributed X;s. In this case, the distribution of S, is known to be Bn p. 

For example, if p = 0.4 and n = 30, then P{S, = k} = Bnp({k}), k = 0, ... ,30, may 
be described in Fig. 7.1. 

The summit of the diagram occurs at k = 12, which is the expected value of S30. 
Enlarging the number of trials leads to a shift of the summit to the right. At the same 
time, the height of the summit becomes smaller. 

The shape of the diagram in Figure 7.1 lets us suggest that sums of independ- 
ent, identically distributed random variables are “almost” normally distributed. If 
this is so, which expected value and which variance would the approximating normal 
distribution possess? 


0.14 - 


0.12 - 


0.10 - 


0.08 - 


0.06 - 


0.04 - 


0.02 - 


0.00 = — 


Figure 7.1: Probability mass function of By», n = 30 and p = 04. 
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Let us investigate this question in the general setting. Thus, we are given a 
sequence (X;);51 of independent identically distributed random variables with fi- 
nite second moment and with wp = EX; and o* = VX, > O. If, as before, 
Sn =X, + «+: + Xp, then 


iSn=nu and VSy = no". 


Consequently, if we conjecture that S, is “approximative” normally distributed, then 
the normalized sum (S, — nu)/o./n should be “approximative” (0, 1)-distributed. 
Recall that Propositions 5.1.36 and 5.2.15 imply 


2 (=) =0 and v (=) =1 
aJ/n a/n 
The question about the possible limit of the normalized sums (S,, — nu)/o./n remained 
open for long time. In 1718 Abraham de Moivre investigated the limit behavior for a 
special case of binomial distributed random variables. As limit he found some infin- 
ite series, not a concrete function. In 1808 the American scientist and mathematician 
Robert Adrain published a paper where for the first time the normal distribution oc- 
curred. A year later, independently of the former work, Carl Friedrich Gauf used the 
normal distribution for error estimates. In 1812 Pierre-Simon Laplace proved that the 
normalized sums of independent binomial distributed random variables approximate 
the normal distribution. Later on, Andrei Andreyevich Markov, Aleksandr Mikhail- 
ovich Lyapunov, Jarl Waldemar Lindeberg, Paul Lévy, and other mathematicians 
continued the work of De Moivre and Laplace. In particular, they showed that the nor- 
mal distribution occurs always as a limit, not only for binomial distributed random 
variables. The only assumption is that the random variables possess a finite second 
moment. We refer to the very interesting book [Fis11] for further reading about the 
history of normal approximation. 

It remains the question in which sense does (S,, — nu)/o./n converge to the stand- 
ard normal distribution. To answer this we have to introduce the concept of the 
convergence in distribution. 


Definition 7.2.1. Let Y;, Yo, ... and Y be random variables with distribution func- 
tions*F,, Fo, ... and F. The sequence (Y,)n>1 converges to Y in distribution 
provided that 


Jim: F,(t) = F() forall t ¢ R at which F is continuous. (7.9) 


: ; D 
In this case one writes Y, —> Y. 


4 Fy(t) = P{Yn < th, teR. 
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Remark 7.2.2. An alternative way to formulate property (7.9) is as follows: 


lim P{Y, < t} = P{Y < ¢} forall t ¢ R with P{Y = t}=0. 
n-oo 
Without proof we state two other characterizations of convergence in distribution. 


Proposition 7.2.3. One has Y;, *¥ if and only if for all bounded continuous functions 
f:R-R 


lim Ef (Yn) = Ef(Y). 


n-oo 


Furthermore, this is also equivalent to 


lim sup P{Y, ¢ A} < P{Y < A} 


n-oo 


for all closed subsets A ¢ R. 


If the distribution function of Y is continuous, that is, we have P{Y = t} = 0 for all 
t ¢ R, then Y,, . Vis equivalent to lim... F,(t) = F(t) for all t « R. Besides, in this 
case, the type of convergence is stronger as the next proposition shows. 


Proposition 7.2.4. Let Y;, Y2, ... and Y be random variables with P{Y = t} = 0 for all 
t¢R. Then Y, = Y implies that the distribution functions converge uniformly, that is, 


Jim sup |P{Y, < t}-P{Y < t}}=0, hence also 
°° teR 
lim sup |P{a < Yn < b}- P{a< Y < b}| =0. 


n-oco a<b 


We have now all notations and definitions that are necessary to formulate the central 
limit theorem. Mostly, this theorem is proved via properties of so-called character- 
istic functions (see Chapter 3 of [Dur10] for such a proof). For alternative proofs using 
properties of moment generating functions we refer to [Ros14] and [Gha05]. 


Proposition 7.2.5 (Central limit theorem). Let (X;);>1 be a sequence of independent 
identically distributed random variables with finite second moment. Let y be the expec- 
ted value of the Xjs and let 0? > 0 be their variance. Then for the sums Sp, = X, + --- + Xn 
it follows that 
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Sn - ny D 


oi he (7.10) 


Here Z is an N(0, 1)-distributed random variable. 


Since the limit Z in statement (7.10) is a continuous random variable, Proposition 7.2.4 
applies, and the central limit theorem may also be formulated as follows. 


Proposition 7.2.6. Suppose (X;);>1 and S, are as in Proposition 7.2.5. Then it follows that 


t 

% on = ny | 1 / = 2/9 

lim sup |P <t e*~dx|=0 and (7.11) 
Here, as | o/n V2n | 

P b 

3 —ny 1 = 2/2 

lim sup P{as fe < | ee dx| =0. (712) 
mee ach on Va J 


Remark 7.2.7. Recall that ® denotes the distribution function of the standard nor- 
mal distribution as introduced in eq. (1.62). Thus, another way to write eq. (7.11) is as 
follows: 


lim sup 
n-oo teR 


Sn - ny 
P| ae <t| vo O. 


Our next objective is another reformulation of eq. (7.12). If we set a’ = ao./n+ nu and 
b’ = bo./n + nu, then these numbers depend on n « N. But since the convergence in 
eq. (7.12) is uniform, we may replace a and b by a’ and b’, respectively and obtain 


lim sup 
N00 qt <b! 


Pa’ <S,<b))-P{" eyes <u} -0. (713) 


Here, as before, Z denotes a standard normally distributed random variable. For a final 
reformulation set 


Zn = O/NZ +n. 
Then eq. (7.13) is equivalent to 


lim sup |P {a’ < S, < b'}-P {a’<Z,<b’}|=0. (7.14) 


N00 ght 


By Proposition 4.2.3, the random variables Z, are N(ny, no?)-distributed, which al- 
lows us to interpret eq. (7.13), or eq. (7.14), as follows. If » = EX; and o? = VX;, then for 
large n, the sum S, is “approximative” \ (np, no*)-distributed. 
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In other words, for —-co < a < b < w it follows that 


7 b-nyy\ _ a-np 
Plas 5, <b}x0(2—2) o(—*). 


Interpretation: We emphasize once more that the central limit theorem is valid for 
all sequences of independent identically distributed random variables possessing a 
second moment. For example, it is true for Xjs that are binomial distributed, for X;s 
being exponentially distributed, and so on. Thus, no matter how the random vari- 
ables with second moment are distributed, all their normalized sums possess the 
same limit, the normal distribution. This explains the outstanding role of the normal 
distribution. 

The deeper reason for this phenomenon is that S, may be viewed as the su- 
perposition of many “small” independent errors or perturbations, all of the same 
kind.’ Although each perturbation is distributed according to Px,, the independent 
superposition of the perturbations leads to the fact that the final result is approxim- 
ative normally distributed. This explains why so many random phenomena may be 
described by normally distributed random variables. 


Remark 7.2.8 (Continuity correction). A slight technical problem arises in the case of 
discrete random variables X;. Then the S,s are discrete as well, hence their distribu- 
tion functions F,, have jumps. If these noncontinuous F;,s approximate the continuous 
function ®, then there occur certain errors at the points where the jumps of Fy are. 
To understand the problem, assume that the X;s possess values in Z, then S, is also 
Z-valued, hence for any 0 < h < 1, and all integers k < 1, it follows that 


P{k <S, <0 =P{k-h<S,<1l+h}. 
Consequently, for each such number h, the value 
l+h-np a ) 
® ® 
( o/n ) ( a/n 
may be taken as normal approximation of the above probability. Which number h < 1 


should be chosen? 
To answer this question observe the following. If 1 < m < k, then 


P{k < Sy < } = P{k < Sy < m}+P{m+1< 5S, < hf, 


5 The central limit theorem also holds for not necessarily identically distributed random variables 
provided that all “errors” become uniformly small. That is, one has to exclude that certain errors are 
dominating the other ones. 
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which is approximated by 


l+h- He h- k-h- 
ol (ara land Ge cd Gre cd Ne 


Thus, in order to get neither an overlap nor a gap between m+1-—h-nyand m+h-np, 
it is customary to choose h = 0.5. This leads to the following definition. 


Definition 7.2.9. Suppose Xj, X2, ... are independent identically distributed with 
values in Z. Then the corrected normal approximation is given by 


1+0.5-ny k= O8 =a 
Pik <Sy<x@® co) : 
aw ( o/n ) ( o/n ) 


It is called continuity correction or histogram correction for the normal 
approximation. In a similar way, one corrects the approximation for infinite 
intervals by 


(OS = 1g 
P{S, < I} x ® (| ——_— 
as ( o/n ) 
and by 
k-0.5-np eee 
P{S, >k}+1-® =@® : 715 
oa ( on ) ( on ee 


The next result tells us that the continuity correction is only needed for small ns. 


Proposition 7.2.10. For allx « Rand he R it follows that 
ea (—*)| |h| 
® ce) < ‘ 
| ( a/n a/n o/2nn 


Proof: The mean value theorem of Calculus implies the existence of an intermediate 
value € in (a a) such that 
a/n 


oJ/n 
jo (* Ta) - (SF) a oan 


Using 
1 _p2 1 
OG) - Tae Ps Te, 


this proves the asserted estimate. a 
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Remark 7.2.11. An application of Proposition 7.2.10 with x = k or x = 1, and with 
h = +0.5, shows that the improvement by the continuity correction is at most of order 
n-/2, Thus, it is no longer needed for large n. 


Example 7.2.12. Roll a fair die n times. Let S, be the sum of the n rolls. In view 
of eq. (7.14), this sum S, is approximately \V (4, 2") distributed. In other words, it 
follows that 


b 
Sn - 7n/2 1 / ee 
lim Pia< < bt = e*/*dx = O(b) - O(a). 
neo | /35n/12 van J 


Moreover, this convergence takes place uniformly for all a < b. Therefore, at least for 
large n, the right-hand side of the last equation may be taken as approximative value 
of the left-hand one. 

At first, we consider an example with a small number of trials. We roll a die three 
times and ask for the probability of the event {7 < S3 < 8}. Let us compare the exact 
value 


1 
P{7 < S3 < 8}= Zo 0.16667 


with the one we get by applying the central limit theorem. Without continuity 
correction, the approximative value is 


8 — 21/2 7 -— 21/2 

® I ® | = 0.08065, 
V3-35/12 V3- 35/12 

while an application of the continuity correction leads to 


0.5 - 21/2 ~ 0.5 - 21/2 
o(% dl “) o(! = 2) - orm. 


V3 - 35/12 V3 - 35/12 


We see the improvement using the continuity correction. 
Next we treat an example with large n. Let us investigate once more Example 7.1.6, 
but this time from the point of view of the central limit theorem. Choose again n = 103, 


100 /12 _ 100/12 . 
a E00 and b 35600" Then it follows that 


b 
1 2 
P{3400 < Sp < 3600} » —— / e*? dx ~ 0.93592. 
. J 271 
a 
As we see, the use of the central limit theorem improves considerably the bound 0, 709 
obtained by Chebyshev’s inequality. 


Example 7.2.13. We investigate once more the model of a random walk on Z as presen- 
ted in Example 4.1.7. What does the central limit theorem imply in this case? To 
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simplify the calculations let us assume p = 1/2, that is, jumps to the left and to the 
right are equally likely. Thus, if X; = -1 if the particle jumps to the left in step j 
and X; = 1 otherwise, then P{X; 1} = P(X = 1} = 1/2and S, = X,+ ++ +X 
is the position of the particle after n jumps. Moreover, by assumption the Xjs are 
independent and, of course, identically distributed. Thus, the central limit theorem 
applies with w = EX; = 0 and with o? = VX, = 1. Consequently, S, is approximately 
N (0, n)-distributed. More precisely, we have 


b 
: ee 1 -x2/2 
Jim Plavn < Sy < bv} = —— |e dx. 
a 


For instance, if a = —2 and b = 2, then it follows that 


2 
1 2 
lim P {-2/n < S, < 2,/n} = —— / e*? dx  0.9544997. 
Te | " V2 A 

Keep in mind that the possible values of S,;, are between —n and n. But in realness, with 
probability greater than 0.95, the position of the particle will be in the much smaller 


interval [-2,/n, 2,/n]. 


Example 7.2.14 (Round-off errors). Many calculations in a bank, for instance of in- 
terest, lead to amounts that are not integral in cents. In this case the bank rounds the 
calculated value either up or down, whether the remainder is larger or smaller than 0.5 
cents. For example, if the calculations lead to $12.837, then the bank transfers $12.84. 
Thus, in this case, the bank loses 0.3 cents. This seems to be a small amount, but if, 
for example, the bank performs 10° calculations per day, the total loss or gain could 
sum up to an amount of $5000.00. But does this really happen? 

Answer: Theoretically, the rounding procedure could lead to huge losses or gains 
of the bank. But, as the central limit theorem shows, in reality such a scenario is ex- 
tremely unlikely. To make this more precise, we use the following model. Let X; be the 
loss or gain (in cents) of the bank in calculation j. Then the X; are independent and 
uniformly distributed on [-0.5, 0.5]. Thus, the total loss or gain after n calculations 
equals S, = X; + --- + Xn. By Propositions 5.1.25 and 5.2.23 we know that 


1 
yw=EX,;=0 and Oo VI 


hence, if a < b, the central limit theorem implies 


b 
; a/n eal 1 / a9 
lim P } —— <S, < ——} = — eX? dy, 
n-r00 ee "* V2) Jan 


a 
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For example, if n = 10°, then taking a = /12 and b = ov, this leads to 
1 foe} 
P{Sn > $10} = P{S, > 10? cents} « —— / e* 2 dx ~ 000026603, 
J 2 
Ji2 


which is an extremely small probability. By symmetry, it also follows that 
F -V'12 
P{Sn < -$10} « —= / e*"? dx ~ 0,00026603. 
{Sn } Sie 


In a similar way one obtains 


P{S, > $1} ~ 0.364517, P{S, > $2} ~ 0.244211 
P{S, > $5} ~ 0.0416323 and P{S, > $20} ~ 2,1311x 10°. 


This shows that even for many calculations, in our case 10° ones, the probability for 
a loss or gain of more than $5 is very unlikely. Recall that theoretically an amount of 


$5000.00 would be possible. 


Special Cases of the Central Limit Theorem: 

Binomial distributed random variables: In 1738 De Moivre, and later on in 1812 Laplace, 
investigated the normal approximation of binomial distributed® random variables. 
This was the starting point for the investigation of general central limit theorems. Let 
us State their result. 


Proposition 7.2.15 (De Moivre-Laplace theorem). Let X; be independent By,p- 
distributed random variables. Then their sums Sy = X, + --- + Xn satisfy 


b 
lim P |: < elie 4 < | ss | Pie (7.16) 
acd Vnp(i - p) van J 


Proof: Recall that for a By,»-distributed random variable X we have yp = EX = p and 
o* = VX = p(1- p). Consequently, Proposition 7.2.6 applies and leads to eq. (716). ™ 


Remark 7.2.16. By Corollary 4.6.2 we know that Sy = X; + --- + Xn is Bn p-distributed. 
Consequently, eq. (7.16) may also be written as 


6 De Moivre investigated sums of B,4/.-distributed random variables while Laplace treated By y- 
distributed ones for general 0 < p < 1. 
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where 


Ina.b = {k20:0s es «| . 
Vnp( - p) 


Another way to formulate the De Moivre—Laplace theorem is as follows. For “large” n, 
Sn is approximative \(np, np(1 — p))-distributed. That is, if0 <1<m<n, then 


m 


n\ ok n-k m—np l-np 
(1- py aw ® © (717) 
2 (i)? 7 (2%. (S| 


Since the sums S,, are integer-valued, the continuity correction should be applied for 
small ns, that is, on the right-hand side of eq. (7.17) the numbers m and / should be 
replaced by m+ 0.5 and 1 by 1-— 0.5, respectively. 


Example 7.2.17. Play aseries of games with success probability 0 < p < 1. Let a € (0,1) 
be a given security probability, and m <« N is some integer. How many games one has 
to play in order to have with probability greater than or equal to 1— a at least m times 
success? 

Answer: Define random variables X; by setting X; = 1 when winning game j, while 
X; = O in the case of losing it. Then the X;s are independent and B,»-distributed. 
Hence, if S, = X; + ---Xpn, then the above question may be formulated as follows. What 
is the smallest n € N for which 


P{S, >m}>1-a? (7.18) 


By Corollary 4.6.2, the sum S, is Bn,»-distributed and, therefore, estimate (7.18) 
transforms to 


(1) ip Uap" s1=a, (7.19) 
k=m 


Thus, the “exact” answer to the above question is as follows. Choose the minimal n > 1 
for which estimate (7.19) is valid. 


Remark 7.2.18. For large m it may be a difficult task to determine the minimal n 
satisfying estimate (7.19). Therefore, one looks for an “approximative” approach via 
Proposition 7.2.15. Rewriting estimate (7.18) as 


Sn — np m-—np 
P > >1-a, 
| el 7 
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an “approximative” condition for n is 


1l-a<1-@0 eee LZ =O ea 
- V/np( = p) np — p)} 
Given B « (0,1), let us define’ ug by P(ug) = f. Consequently, the approximative 
solution of the above question is to choose the minimal n > 1 satisfying 
np-m 


Vp-p) 


For “small” n we have to modify the previous approach slightly. Here we have to use 
the continuity correction. In view of eq. (7.15) the condition is now 


: co (Manees 


Vnp(l - p) 


(7.20) 


leading to 


np-m+0.5 " 


vmpt-p) 


Let us explain Remark 7.2.18 with the help of a concrete example. 


(7.21) 


Example 7.2.19. Find the minimal n > 1 such that, rolling a fair die n times, one 
observes with probability greater than or equal to 0.9 at least 100 times the number 6 ? 
The “exact” answer is, choose the minimal n > 1 satisfying 


X G(s) (@) 22s 

k=100 k}\6 6 7 

Numerical calculations give that the left-hand side equals 0.897721 if n = 670, and it 
is 0.900691 if n = 671. Thus, in order to observe, with probability greater than 0.9, the 
number “6” at least 100 times, one has to roll the die at least 671 times. 

Let us compare this result with the one that we obtain by the approximative 
approach. First we approximate S, directly, that is, without applying the continu- 
ity correction. Here estimate (7.20) says that we have to look for the minimal n > 1 
satisfying 


n— 600 
J/5n 


> Ugg = 1.28155. (7.22) 


7 Later on, in Section 8.4.3, these numbers ug will play an important role; compare also Definition 
8.4.5. 
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Since 
665 — 600 666 — 600 
———— = 1.12724 and ———— =1.4373, 
5-665 V5-666 


the smallest n satisfying estimate (7.22) is 666. 
Applying the continuity correction, by estimate (7.21), condition (7.22) has to be 
replaced by 


n-—600+3 
J/5n 


The left-hand side equals 1.27757 for n = 671 and 1.29387 if n = 672. Consequently, this 
type of approximation gives the (more precise) value n = 672 for the minimal number 
of necessary rolls of the die. 


> Uo9 = 1.28155. 


Poisson distributed random variables: Let X;,X2,... be independent and Pois,- 
distributed. By Propositions 5.1.16 and 5.2.20 we know 


1X1 =A and VX; =A. 
Thus, in this case Proposition 7.2.6 reads as follows. 


Proposition 7.2.20. Let (Xj);.1 be independent Pois,-distributed. Then the sums 
Sy =X, + + + Xy satisfy 


b 


. Sn — na | 1 / a 
lim Pia< <bs= eX? dy, 7.23 
teres | Vna J2n Mee, 


a 


Remark 7.2.21. By Proposition 4.6.5, the sum S, is Pois,,-distributed, hence eq. (7.23) 
transforms to 


lim > re aoe / e* 2 dx, (7.24) 


where 


k-na 
In,a,b 2= {keNo:as an < | ; 


atom) -0( 2). 


7.2 Central Limit Theorem —— 295 


Remark 7.2.22. Choosing in eq. (7.24) the numbers as a = —oo, b = 0 and A = 1, we get 
n ok 
: = nn 1 
jime") Mo? 
k=0 
which is interesting in its own right. Taking a = —co and by, = ./n yields 
b 
an kK n 
n 1 2 
lim |e" 5 * — = —— ] e* "ay =0, 
N00 dX kl J/2n / 


hence, because of 


Jn 
1 2 
lim — fe dx =1, 
noo /2i 


we obtain 
2n nk 
lim e"” —— = 
N00 2 k! 


Gamma distributed random variables: Finally, we investigate sums of gamma distrib- 
uted random variables. Here the central limit theorem leads to the following result. 


Proposition 7.2.23. Let X;,X2, ... be independent T'4,,-distributed random variables. 
Then their sums Sy, = X1 + --- + Xp Satisfy 


b 

: Sn — naB 1 / -2/2 

lim Pja< —" <b} = e dx. 7.25 
nro | a,/np vin J vee 


Proof: Propositions 5.1.26 and 5.2.24 tell us that the expected value and the variance 
of the Xjs are given by p = EX, = af and o” = VX, = a’. Therefore, eq. (7.25) follows 
by an application of Proposition 7.2.6. a 


Remark 7.2.24. Note that Proposition 4.6.11 implies that S, is T'g,,g-distributed. Thus, 
setting 


Tab 1= ee ee ; 
a,/np 


eq. (7.25) leads to 


b 


il 1 2 
lim ———_ xB eX/a dy = = /e 2 ay, 
noo a"PT(nB) / / 2 
a 


In,a,b 
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Another way to express this is as follows. If 0 < a < b, then 
1 ; b B ‘B 
yn ale dy ww ®O — na in a-nQq 
a”BT(n) / ( a,/np a,/nB 


Two cases of Proposition 7.2.23, or of Remark 7.2.24, are of special interest. 
(a) Forn>1letS, bea x?-distributed random variable. Then it follows that 


noo 2n 


b 
lim Pia < Sat < | 28 fevPax. 
J. V2n 
a 
(b) If S, is distributed according to the Erlang distribution F,,,, then we get 


b 
. AS, -Nn | 1 fe 
lim Pia< <bs= eX dy, 
N00 Jn van J 


For A = 1 this implies (set a = —co and b = 0) that 


n 10) 
1 1 1 
lim aay | eax = / eX Pax ==, 
n>ee [(n) J Jn 2 


Additional Remarks: 
(1) We play a series of the same game. Suppose in each game we may lose or win 
a certain amount of money. A natural condition for this game (among friends) is it 
should be fair. But what does it mean that a game is fair? Is this the case if 
(i) the average loss or gain in each single game is zero or is it fair if 
(ii) the probability that, after n games, the total loss or gain is positive, tends to 1/2 
as n tends to infinity? 


The mathematical formulation of the previous question is as follows. Let X;, X2, ... 
denote the win or loss in the first game, the second one, and so on. Then the Xjs are 
independent identically distributed random variables. The above question reads now 
as follows. Is the game fair if 

(i) the expected value p = EX; satisfies p = 0 or is this the case if 

(ii) thesum Sy :=X,+ --- +X, fulfills 


1 
jim P{S, < O} = Jim P{S, > O} = 5 2 (7.26) 


In the sequel we have to exclude the trivial case P{X; = 0} = 1, that is, in each game 
one neither wins nor loses some money. Of course, then eq. (7.26) does not hold. 

At a first glance one might believe that the two possible definitions of fairness 
describe the same fact. But this is not so as one may see in an example in [Fel68], 
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Chapter X, §4. There one finds a sequence of independent random variables Xj, X2, ... 
with EX, = 0, however 


Jim P{Sy < O}=1. 


In particular, this tells us that, in general, condition (i) does not imply condition (ii). 
The next result clarifies the relation between these two definitions of fairness in 
the case that the random variables possess a finite second moment. 


Proposition 7.2.25. Let X,,X2,... be independent and identically distributed with 

expected value yu. Assume P{X; = 0} < 1. 

1. Then eq. (7.26) always implies p = 0. That is, a fair game in the sense of (ii) also 
satisfies condition (i). 

2. Conversely, if E|X,|* < 00, then (ii) is a consequence of (i). Hence, assuming the 
existence of a second moment, conditions (i) and (ii) are equivalent. 


Proof: We prove the contraposition of the first statement. Thus, suppose that (i) does 
not hold, that is, we have p # 0. Without losing generality we may assume p > 0. 
Otherwise, investigate —X,, —X2, .... An application of Proposition 7.1.27 with € = y/2 
yields 


< 


: Sn 
lim P 4 |— - 


n-— oo 


NIS 


| =1. (7.27) 


Since Sn = x < p/2 implies Sn > p/2, hence S, > 0, it follows that 
S 
P| a: <4 < P{S, > O}. 


Consequently, from eq. (7.27) we derive 


Jim P{S, > O}=1, 
hence eq. (7.26) cannot be valid. This proves the first part of the proposition. 
We prove now the second assertion. Thus, suppose pi = 0 as well as the existence 
of the variance o? = VX;. Note that o? > 0. Why? Ifa random variable X satisfies EX = 0 
and VX = 0, then necessarily P{X = 0} = 1. But, since we assumed P{X, = 0} < 1, we 
cannot have o? = VX; = 0. 
Thus, Proposition 7.2.6 applies and leads to 


Sn 1 r 2 1 

lim P O} = li o} = RR dy = _, 

jim P{Sn > 0} Jim { 2 | ax | 2 
0 


The proof for P{S, < 0} > 1/2 follows in the same way, thus eq. (7.26) is valid. This 
completes the proof. a 
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(2) How fast does ae converge to a normally distributed random variable? 
Before we answer this question, we have to determine how this speed is measured. In 
view of Proposition 7.2.6 we use the following quantity depending on n > 1: 


Sn = ny }-s fe ae ay 
sup |P < 
a o/n J 27 


Doing so, the following classical result holds (see [Dur10], Section 3.4.4, for a proof). 


Proposition 7.2.26 (Berry—-Esséen theorem). Let X;, X2, ... be independent identic- 
ally distributed random variables with finite third moment, that is, with E|X,|? < oo. 
If u = EX, and o* = VX; > 0, then it follows that 


Sn= MM <t|-T ic 2/2 Par no 
sup |P dx| < C —— ‘ 7.28 
a o/n — J/2n 28) 


Here C > 0 denotes a universal constant. 


Remark 7.2.27. The order n~!? in estimate (7.28) is optimal and cannot be improved. 
This can be seen by the following example. Take independent random variables 
X1, Xo, ... with P{Xx; 1} = P{X; = 1} = 1/2. Hence, in this case p = O and o? = 1. 
Then one has 


1/2 


tie fe Sn |-— fs = "22 dy 
liminfn”* sup |P { — < * >0. (7.29) 
moo ve (7 Vin 


Assertion (7.29) is a consequence of the fact that, if n is even, then the function 
tHP | on < | has a jump of order n-! at zero. This follows by the calculations in Ex- 
ample 4.1.7. On the other hand, t + (ft) is continuous, hence the maximal difference 
between these two functions is at least the half of the height of the jump. 


Remark 7.2.28. The exact value of the constant C > O appearing in estimate (7.28) 
is, in spite of intensive investigations, still unknown. At present, the best-known 
estimates are 0.40973 < C < 0.478. 


7.3 Problems 


Problem 7.1. Let A;, A>, ... and By, Bz, ... be two sequences of events in a probability 
space (Q, A, P). Prove that 


lim sup(A,, U B,) = limsup(A,) u lim sup(B,). 


noo n-co n->co 
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Is this also valid for the intersection? That is, does one have 


lim sup(A, 2 By) = lim sup(A,) lim sup(B,) ? 


n-co n-co noo 


Problem 7.2. Let (Xn)n>1 be a sequence of independent E)-distributed random vari- 
ables. Characterize sequences (c,),>1 of positive real numbers for which 


P{Xy > Cy infinitely often} = 1? 


Problem 7.3. Let f : [0,1] ~ R be a continuous function. Its Bernstein polynomial 
Bi of degree n is defined by 


Bx) = i (=) (i) ay), O<x<1. 
k=0 


Show that Proposition 7.1.29 implies the following. If P is the uniform distribution on 
[0, 1], then 


P {x € [0,1]: lim Bho) = foo} =1. 


Remark: Using methods from Calculus, one may even show the uniform convergence, 
that is, 


lim sup |B,(x) —f(x)| = 0. 


N00 Q<x<1 


Problem 7.4. Roll a fair die 180 times. What is the probability that the number “6” 
occurs less than or equal to 25 times. Determine this probability by the following three 
methods: 


— Directly via the binomial distribution. 
— Approximative by virtue of the central limit theorem. 
— Approximative by applying the continuity correction. 


Problem 7.5. Toss a fair coin 16 times. Compute the probability to observe exactly 
eight times “head” by the following methods. 


— Directly via the binomial distribution. 
— Approximative by applying the continuity correction. 


Why does one not get a reasonable result using the normal approximation directly, 
that is, without continuity correction? 
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Problem 7.6. Let X;, Xz, ... be a sequence of independent G,-distributed random 


variables, that is, for some 0 < p < 1one has 


P(X; =k}=p(i-p)k1, k=1,2,... 


1. What does the central limit theorem tell us in this case about the behavior of the 


sums S, = X, + ---Xyn? 
2. For two real numbers a < b set 


k-n(i - 
Inap= |k2 0: 4s OP |. 


vn =p) 
Show that 
n 1 ; 2 


Hint: Use Corollary 4.6.9 and investigate S, —n. 


8 Mathematical Statistics 


8.1 Statistical Models 
8.1.1 Nonparametric Statistical Models 


The main objective of Probability Theory is to describe and analyze random experi- 
ments by means of a suitable probability space (Q, A, P). Here it is always assumed 
that the probability space is known, in particular, that the describing probability 
measure, P, is identified. 


Probability Theory: 
Description of a random experiment and its properties by a probability space. The 
distribution of the outcomes is assumed to be known. 


Mathematical Statistics deals mainly with the reverse question: one executes an ex- 
periment, that is, one draws a sample (e.g., one takes a series of measurements of an 
item or one interrogates several people), and, on the basis of the observed sample, 
one wants to derive as much information as possible about the (unknown) underlying 
probability measure P. Sometimes the precise knowledge of P is not needed; it may 
suffice to know a certain parameter of P. 


Mathematical Statistics: 
As a result of a statistical experiment, a (random) sample is observed. On its basis, 
conclusions are drawn about the unknown underlying probability distribution. 


Let us state the mathematical formulation of the task: first we mention that it is stand- 
ard practice in Mathematical Statistics to denote the describing probability space by 
(X, F, P). As before, ¥ is the sample space (the set that contains all possible outcomes 
of the experiment), and F is a suitable o-field of events. The probability measure P de- 
scribes the experiment, that is, P(A) is the probability of observing a sample belonging 
to A, but recall that P is unknown. 

Based on theoretical considerations or on long-time experience, quite often we 
are able to restrict the entirety of probability measures in question. Mathematically, 
that means that we choose a set P of probability measures on (4, F), which contains 
what we believe to be the “correct” P. Thereby, it is not impossible that P is the set of 
all probability measures, but for most statistical methods it is very advantageous to 
take P as small as possible. On the other hand, the set P cannot be chosen too small, 
because we have to be sure that the “correct” P is really contained in P. Otherwise, the 
obtained results are either false or imprecise. 
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Definition 8.1.1. A subset P of probability measures on (4, F) is called a dis- 
tribution assumption, that is, one assumes that the underlying P belongs 
to P. 


After having fixed the distribution assumption P, one now regards only probability 
measures P ¢ P or, equivalently, measures not in P are sorted out. 

To get information about the unknown probability measure, one performs a stat- 
istical experiment or analyzes some given data. In both cases, the result is a random 
sample x ¢« %. The task of Mathematical Statistics is to get information about P «¢ P, 
based on the observed sample x « V. A suitable way to describe the problem is as 
follows. 


Definition 8.1.2. A (nonparametric) statistical model is a collection of probabil- 
ity spaces (V, F, P) with P « P. Here, X and F are fixed, and P varies through the 
distribution assumption P. One writes for the model 


(X,F, P)pep or {(X, F, P) 4 PeP} . 


Let us illustrate the previous definition with two examples. 


Example 8.1.3. In an urn are white and black balls of an unknown ratio. Let 6 ¢ [0, 1] 
be the (unknown) proportion of white balls, hence 1 — 0 is that of the black ones. In or- 
der to get some information about 6, one randomly chooses n balls with replacement. 
The result of this experiment, or the sample, is anumber k « {0,..., n}, the frequency 
of observed white balls. Thus, the sample space is ¥ = {0,...,n}andas o-field we may 
choose, as always for finite sample spaces, the powerset P(V). The possible probabil- 
ity measures describing this experiment are binomial distributions B, 9 with O < 0 <1. 
Consequently, the distribution assumption is 


P = {B,,9 : 0 ¢ (0, 1]}. 
Summing up, the statistical model describing the experiment is 
(X*,P(X),P)pep where ¥={0,...,n} and P={Byg:0<6<]}. 
Next, we consider an important example from quality control. 


Example 8.1.4. A buyer obtains from a trader a delivery of N machines. Among them 
M < N are defective. The buyer does not know the value of M. To determine it, he 
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randomly chooses n machines from the delivery and checks them. The result, or the 
sample, is the number 0 < m < nof defective machines among the n tested. 

Thus, the sample space is V = {0,...,n}, F = P(X), and the probability measures 
in question are hypergeometric ones. Therefore, the distribution assumption is 


P = {Hy,mn:M=0,...,N}, 


where Hyun denotes the hypergeometric distribution with parameters N, M, and n, 
as introduced in Definition 1.4.25. 


Before we proceed further, we consider a particularly interesting case of statistical 
model, which describes the n-fold independent repetition of a single experiment. 
To explain this model, let us investigate the following easy example. 


Example 8.1.5. We are given a die that looks biased. To check this, we roll it n 
times and record the sequence of numbers appearing in each of the trials. Thus, our 
sample space is ¥ = {1,...,6}", and the observed sample is x = (x%1,...,Xn), with 
1< xx < 6. Let 0;,..., 6 be the probabilities for 1 to 6. Then we want to check whether 
6,=--- = 0 = 1/6, that is, whether Po given by Po({k}) = 0,1 < k < 6, is the 
uniform distribution. What are the possible probability measures on (Vv, P(4%)) de- 
scribing the statistical experiment? Since the results of different rolls are independent, 
the describing measure P is of the form P = P§” with 


P2"({x}) = Po(faa})- - -Po({xn}) = 67-06%, x= O4....%), 


and where the m;s denote the frequency of the number 1 < k < 6 in the sequence x. 
Consequently, the natural distribution assumption is 


P = {P§" : Po probability measure on {1,..., 6}}. 


Suppose we are given a probability space (%, Fo, Po) with unknown Po «€ Po. Here, 
Py denotes a set of probability measures on (%, Fo), hopefully containing the “cor- 
rect” Po. We call (Xo, Fo, Po)pocpy the initial model. In Example 8.1.5, the initial 
model is Xp = {1,..., 6}, while Po is the set of all probability measures on (Xo, P(Fo)). 

In order to determine Po, we execute n independent trials according to Po. The 
result, or the observed sample, is a vector x = (xX1,...,Xy) with x; « XY. Consequently, 
the natural sample space is V = X?. 

Which statistical model does this experiment describe? To answer this question, 
let us recall the basic results in Section 1.9, where exactly those problems have been 
investigated. As o-field F we choose the n times product o-field of Fo, that is, 


F=Fo0®---@®Fo, 
e+ 


n times 
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and the describing probability measure P is of the form Pe. that ist, it is the n-fold 
product of Po. Recall that, according to Definition 1.9.5, the product P§” is the unique 
probability measure on (4, F) satisfying 


PQ"(Ay x- + x An) = Po(Ai)- - - Po(An), 


whenever Aj € Fo. 

Since we assumed Po € Po, the possible probability measures are Pe with 
Po [= Po. 

Let us summarize what we obtained until now. 


Definition 8.1.6. The statistical model for the n-fold independent repetition of 
an experiment, determined by the initial model (%, Fo, Po)ppcpp, is given by 


(x, IF PR" )poePy 


where 4 = Xj, F denotes the product o-field of the Fos, and We is the n-fold 
product measure of Po. 


Remark 8.1.7. Of course, the main goal in the model of n-fold repetition is to get some 
knowledge about Po. To obtain the desired information, we perform n independent 
trials, each time observing a value distributed according to Po. Altogether, the sample 
is a vector x = (Xj, ...,Xn), which is now distributed according to Pe 


The two following examples explain Definition 8.1.6. 


Example 8.1.8. A coin is labeled on one side with “O” and on the other side with “1.” 
There is some evidence that the coin is biased. To check this, let us execute the follow- 
ing statistical experiment: toss the coin n times and record the sequence of zeroes and 
ones. Thus, the observed sample is some x = (X;,..., Xn), With each x; being either “0” 
or “1.” 

Our initial model is given by XY = {0,1} and Po = By» for a certain (unknown) 
@ « [0, 1]. Then the experiment is described by V = {0, 1}" and P = {BPG :0<6< 1}. 
Note that 


BOM) = OL OYE, kaa te xy. 


Example 8.1.9. A company produces a new type of light bulbs with an unknown dis- 
tribution of the lifetime. To determine it, n light bulbs are switched on at the same 
time. Let t = (t,,..., t,) be the times when the bulbs burn through. Then our sample is 
the vector t € (0, 00)”. 


8.1 Statistical Models ——= 305 


By long-time experience one knows the lifetime of each light bulb is exponen- 
tially distributed. Thus, the initial model is (R, B(R), Po) with Po = {E, : A > O}. 
Consequently, the experiment of testing n light bulbs is described by the model 


(R", B(R"), P®") pep, = (R", B(R"), EF" isa ’ 


where Po = {E, : A > O}. Recall that ES is the probability measure on (R"”, B(R")) with 
density p(t, ..., tn) =A" e Mat +t) for & > 0. 


8.1.2 Parametric Statistical Models 


In all of our previous examples there was a parameter that parametrized the probab- 
ility measures in P in natural way. In Example 8.1.3, this is the parameter 0 « [0,1], 
in Example 8.1.4, the probability measures are parametrized by M « {0,..., N}, in Ex- 
ample 8.1.8 the parameter is also 6 ¢€ [0,1], and, finally, in Example 8.1.9 the natural 
parameter is A > 0. Therefore, from now on, we assume that there is a parameter set 
© such that P may be represented as 


P= {P,:0¢ 0}. 


Definition 8.1.10. A parametric statistical model is defined as 
(¥ a LP ) Pe)eco 


with parameter set ©. Equivalently, we suppose that the distribution assump- 
tion P, appearing in Definition 8.1.2, may be represented as P = {Pg : 0c O}. 


In this notation, the parameter sets in Examples 8.1.3, 8.1.4, 8.1.8, and 8.1.9 are 0 = 
[0, 1], © = {0,..., N}, © = [0,1], and © = (0, ov), respectively. 


Remark 8.1.11. It is worthwhile mentioning that the parameter can be quite general; 
for example, it can be a vector 0 = (0,,..., 0x), so that in fact there are k unknown para- 
meters 6;, combined to a single vector @. For instance, in Example 8.1.5, the unknown 
parameters are 0;,..., 06, thus, the parameter set is given by 


6 ={0=(,...,06):0>0, 6, +---+46 = 1}. 


Let us present two further examples with slightly more complicated parameter sets. 
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Example 8.1.12. We are given an item of unknown length. It is measured by an in- 
strument of an unidentified precision. We assume that the instrument is unbiased, 
that is, on average, it shows the correct value. In view of the central limit theorem 
we may suppose that the measurements are distributed according to a normal dis- 
tribution (yu, 07). Here p is the “correct” length of the item, and o > 0 reflects the 
precision of the measuring instrument. A small o > O says that the instrument is quite 
precise, while large os correspond to inaccurate instruments. Consequently, by the 
distribution assumption the initial model is given as 


(R, B(R), Ny, O uer, o2>0° 


In order to determine py (and maybe also o) we measure the item n times by the same 
method. As a result we obtain a random sample x = (x1,...,Xn) € R". Thus, our model 
describing this experiment is 


(R", B(R"), NU, 0 )?") ,02)eRx(0,00) : 


Because of eq. (6.9), the model may also be written as 


(R", B(R"), Ny OIn)) Gy 02)cRx(0,00) 
with jt = (u,...,) € R", and with diagonal matrix o7J,. The unknown parameter is 
(u, 0”), taken from the parameter set R x (0, 00). 


Example 8.1.13. Suppose now we have two different items of lengths yw; and p2. We 
take m measurements of the first item and n of the second one. Thereby, we use dif- 
ferent instruments with maybe different degrees of precision. All measurements are 
taken independently of each other. As a result we get a vector (x,y) ¢ R”*", where 
xX = (Xj,...,Xm) are the values of the first m measurements and y = (y1,..., Yn) those 
of the second n one. As before we assume that the x;s are distributed according to 
N (4, 07), and the yjs according to N(u2, 05). We neither know py; and p2 nor of and 
03. Thus, the sample space is R”*” and the vectors (x, y) are distributed according to 
N (i, Ho), Ry, 03) with diagonal matrix R 02,03 having 07 on its first m entries and 05 on 
the remaining n ones. 
Note that by Definition 1.9.5, 


N (Ga, 2), Ry? 02) aa N (i, ee ® N (po, oa)" . 
This is valid because, if A « B(UR™) and B ¢ B(R"), then it follows that 


N (Ga, Ha)» Rg2,g3)(A x B) = NQ, 07)°"(A) - (U2, 03)°"(B) . 
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The parameter set in this example is given as IR? x (0, oo)”, hence the statistical model 
may be written as 


(R™", BRR™"), N (pr, 02)" @ Na, 03)" 02 o2RPa( 0,00) 


8.2 Statistical Hypothesis Testing 
8.2.1 Hypotheses and Tests 


We start with a parametric statistical model (7, F, Pg)gcq. Suppose the parameter set 

Q is split up into disjoint subsets ©) and 0). The aim ofa test is to decide, on the basis 

of the observed sample, whether or not the “true” parameter 0 belongs to Oo or to Q). 
Let us explain the problem with two examples. 


Example 8.2.1. Consider once more the situation described in Example 8.1.4. Assume 
there exists a critical value Mo < N such that the buyer accepts the delivery if the 
number M of defective machines satisfies M < Mo. Otherwise, if M > Mo, the buyer 
rejects it and sends the machines back to the trader. In this example the parameter set 
is © = {0,..., N}. Letting Og = {0,..., Mo} and ©; = {Mp +1,..., N}, the question about 
acceptance or rejection of the delivery is equivalent to whether M <€ Oo or M € 0}. 
Assume now the buyer checked n of the N machines and found m defective machines. 
On the basis of this observation, the buyer has to decide about acceptance or rejection, 
or, equivalently, about M € © or M € ©. 


Example 8.2.2. Let us consider once more Example 8.1.13. There we had two measur- 
ing instruments, both being unbiased. Consequently, the expected values py, and p2 
are the correct lengths of the two items. The parameter set was 9 = R? x (0, o0)*. Sup- 
pose we conjecture that both items are of equal length, that is, we conjecture py = po. 
Letting 


Qo := {4 H, 07,05) : WE R, 07,05 > O} 
and ©; = ©\Qpo, to prove or disprove the conjecture, we have to check whether 
(11, 2, 07, 05) belongs to Oo or to ®}. 
On the other hand, if we want to know whether or not the first item is smaller than 
the second one, then we have to choose 
Go = {G1 Ha, 07, 03) 2700 <1 S$ W2< oO, 07, 05 > O} 


and to check whether or not (11, 2, 07, 05) belongs to Oo. 


An exact mathematical formulation of the previous problems is as follows. 
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Definition 8.2.3. Let (¥, F, Pg)g-q be a parametric statistical model and suppose 
© = Oo U Oj with Op n Oj = @. 

Then the hypothesis or, more precisely, null hypothesis Ho says that for the 
“correct” 0 « © one has @ € Qo. This is expressed by writing Hp : 0 € Oo. 

The alternative hypothesis H says 6 € 0;. This is formulated as Hj; : 0< 0,. 


After the hypothesis is set, one executes a statistical experiment. Here the order is 
important: first one has to set the hypothesis, then test it, not vice versa. If the hypo- 
thesis is chosen on the basis of the observed results, then, of course, the sample will 
confirm it. 

Say the result of the experiment is some sample x € 1. One of the fundamental 
problems in Mathematical Statistics is to decide, on the basis of the observed sample, 
about acceptance or rejection of Hp. The mathematical formulation of the problem is 
as follows. 


Definition 8.2.4. A (hypothesis) test T for checking Hp (against H)) is a disjoint 
partition T = (%%, %;) of the sample space 4. The set % is called the region of 
acceptance while . is said to be the critical region’ or region of rejection. 
By mathematical reasoning we have to assume % € F, which of course implies 
X, € F as well. 


Remark 8.2.5. A hypothesis test T = (4, X;) operates as follows: if the statistical 
experiment leads to a sample x € 11, then we reject Ho. But, if we get an x € %%, then 
this does not contradict the hypothesis, and for now we may furthermore work with it. 


Important comment: If we observe an x € %o, then this does not say that Hp is 
correct. It only asserts that we failed to reject it or that there is a lack of evidence 
against it. 


Let us illustrate the procedure with Example 8.2.1. 

Example 8.2.6. By the choice of 0» and ©, the hypothesis Hp is given by 
Hop:O0<M<Mo, hence H,:Mo<M<N. 

To test Ho against H,, the sample space ¥ = {0,...,n} is split up into the two regions 


X := {0,...,mo} and 4 := {mo +1,...,n} with some (for now) arbitrary number 
mo € {0,...,n}. If among the checked n machines m are defective with some m > mo, 


1 Sometimes also called “critical section.” 
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then m € %, hence one rejects Ho. In this case the buyer refuses to take the delivery 
and sends it back to the trader. On the other hand, ifm < mo, then m € X%, which does 
not contradict Ho, and the buyer will accept the delivery and pay for it. Of course, the 
key question is how to choose the value mo in a proper way. 


Remark 8.2.7. Sometimes tests are also defined as mappings ~ : Y = {0, 1}. The link 
between these two approaches is immediately clear. Starting with g the hypothesis 
test T = (Xo, %) is constructed by % = {x « X : p(x) = Of}and % = {xe xX: 
g(x) = 1}. Conversely, if T = (%o, 41) is a given test, then set g(x) = Oif x « X% 
and g(x) =1 for x € 44. The advantage of this approach is that it allows us to define so- 
called randomized tests. Here y : X = [0, 1]. Then, as before, Vp = {x ¢ ¥ : p(X) = O} 
and % = {xe ¥ : go) = 1}. If 0 < pW) <1, then 


~(x) = P{reject Ho if x is observed}. 
That is, for certain observations x « V, an additional random experiment (e.g., tossing 
a coin) decides whether we accept or reject Hip. Randomized tests are useful in the case 
of finite or countably infinite sample spaces. 
When applying a test T = (Xo, 41) to check the null hypothesis Ho : @ € Qo, two 


different types of errors may occur. 


Definition 8.2.8. An error of the first kind or type I error occurs if Ho is true 
but one observes a sample x € 1, hence rejects Ho. 


Type l error = incorrect rejection of a true null hypothesis 


In other words, a type I error happens if the “true” 0 is in Qo, but we observe an x € 1}. 


Definition 8.2.9. An error of the second kind or type II error occurs if Ho 
is false, but the observed sample lies in %, hence we do not reject the false 
hypothesis Ho. 


Type ll error = failure to reject a false null hypothesis 


Consequently, a type II error occurs if the “true” 0 is in ©,, but the observed sample is 
an element of the region of acceptance 1p. 
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Example 8.2.10. In the context of Example 8.2.6 a type I error occurs if the delivery 
was well, but among the checked machines were more than mo defective, so that the 
buyer rejects the delivery. Since the trader was not able to sell a proper delivery, this 
error is also called the risk of the trader. 

On the other hand, a type II error occurs if the delivery is not in good order, but 
among the checked machines were only a few defective ones (less than or equal to 
mo). Thus, the buyer accepts the bad delivery and pays for it. Therefore, this type of 
error is also called the risk of the buyer. 


8.2.2 Power Function and Significance Tests 


The power of a test is described by its power function defined as follows. 


Definition 8.2.11. Let T = (Xo, 44) be a test for Ho : 0 € Oo against Hy : 0 « Q4. 
The function By from @ to [0, 1] defined as 


Br(@) := Po(%) 


is called the power function of the test T. 


Remark 8.2.12. If @ € Qo, that is, if Ho is true, then By(6) = Po(%) is the probability 
that 1; occurs or, equivalently, that a type I error happens. 

On the contrary, if 9 ¢ ©, that is, Ho is false, then 1 — Br(6) = Pe(%o) is the 
probability that Vp occurs or, equivalently, that a type II error appears. 

Thus, a “good” test should satisfy the following conditions: the power function Br 
attains small values on Qo and/or 1- fy has small values on @;. Then the probabilities 
for the occurrence of type I and/or type II errors are not too big.* 


Example 8.2.13. What is the power function of the test presented in Example 8.2.6? 
Recall that © = {0,..., N}and 4%, = {mo + 1,..., n}. Hence, Br maps {0, ..., N} to [0, 1] 
in the following way: 

My (N-M) 
2 2, ee Alen 
Br(M) = Hy.un(%)= >> 


(8.1) 
m=mo+t1 (7) 


2 In the literature the power function is sometimes defined in a slightly different way. If 8 « Oo, then 
it is as in our Definition 8.2.11 while for 6 « ©; one defines it as 1 — Br(6). Moreover, for 1 — Br one finds 
the notations operation characteristics or oc-function. 
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Thus, the maximal probability for a type I error is given by 


n My) (N-M 
max fBr(M)= max ()(n-m) 
0<M<Mo O<M<Mo |,“ md 


’ 


while the maximal probability for a type II error equals 


mo (M\ (N-M 
Pgs - Br(M)) = ax ie a m) On m) 


0<MN F (") 


Remark 8.2.14. The previous example already illustrates the dilemma of hypothesis 
testing. To minimize the type I error one has to choose mp as large as possible. But 
increasing mo enlarges the type IJ error. 

This dilemma occurs always in the theory of hypothesis testing. In order to mini- 
mize the probability of a type I error, the critical region 1; has to be chosen as small as 
possible. But making 1, smaller enlarges Vo, hence the probability for the occurrence 
of a type II error increases. In the extreme case, if %, = @, hence X% = 4, thena 
type I error I cannot occur at all. In the context of Example 8.2.6 that means the buyer 
accepts all deliveries and the trader takes no risk. 

On the other hand, to minimize the occurrence of a type II error, the region of ac- 
ceptance %o has to be as small as possible. In the extreme case, if we choose % =, 
then a type II error cannot occur because we always reject the hypothesis. In the con- 
text of Example 8.2.6 this says the buyer rejects all deliveries. In this way he avoids 
buying any delivery of bad quality, but he also never gets a proper one. Thus the buyer 
takes no risk. 


It is pretty clear that both extreme cases presented above are very absurd. Therefore, 
one has to find a suitable compromise. The approach for such a compromise is as fol- 
lows: in a first step one chooses tests where the probability of a type I error is bounded 
from above. And in a second step, among all these tests satisfying this bound, one 
takes the one that minimizes the probability of a type II error. More precisely, we will 
investigate tests satisfying the following condition. 


Definition 8.2.15. Suppose we are given a number a « (0, 1), the so-called signi- 
ficance level. A test T = (Xo, 4) for testing the hypothesis Hp : 0 « Qo against 
Hj, : @ € Q, is said to be an a-significance test (or shorter a-test), provided the 
probability for the occurrence of a type I error is bounded by a. That is, the test 
has to satisfy 


sup Br(6) = sup Pe(%) < a. 
80 0€Oo9 
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Interpretation: The significance level a is assumed to be small. Typical choices are 
a = 0.1 or a = 0.01. Let T be an a-significance test and assume that Ho is true. If we 
observe now a sample in the critical region 1), then an event occurred with probability 
less than or equal to a, that is, a very unlikely event has been observed. Therefore, we 
can be very sure that this could not happen provided that Hp would be true, and we 
reject this hypothesis. The probability that we made a mistake is less than or equal 
to a, hence very small. 

Recall that a-significance tests admit no bound for the probability of a type II error. 
Therefore, we look for those a-significance tests that minimize the probability for a 
type II error. 


Definition 8.2.16. Let T, and T2 be two a-significance tests for checking Ho 
against Hl. If their power functions satisfy 


Br, (0) > Br,(0), 9¢ 1, 


then we say that T, is (uniformly) more powerful than T). 
A (uniformly) most powerful a-test T is one that is more powerful than all 
other a-tests. 


Remark 8.2.17. Note that Br,(@) > Br,(6) implies 1 — Br,(@) < 1—- Br,(0), hence if T; is 
more powerful than T2, then, according to Remark 8.2.12, the probability for the oc- 
currence of a type II error is smaller for T, than it is for T,. Therefore, a most powerful 
a-test is the one that minimizes the probability of occurrence of a type II error. 


Remark 8.2.18. The question about existence and uniqueness of most powerful a- 
tests is treated in the Neyman—Pearson lemma and its consequences. We will not 
discuss that problem here; instead we will construct most powerful tests in concrete 
situations. See [CBO2], Chapter 8.3.2, for a detailed discussion of the Neyman—Pearson 
lemma and its consequences. 


We start with the construction of such tests in the hypergeometric case. Here we have 
the following. 

Proposition 8.2.19. If the statistical model is (V,P(%), Hu.nwn)M-o,..,.n with X = 
{0,...,n}, then a most powerful a-test for testing M < Mo against M > Mp is given 
by T = (Xo, 4X), where Xo = {0,..., Mo}, and mo is defined by 


- win fieen: 3 (alte) ca}. 


m=k+1 G) 
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Proof: The proof of Proposition 8.2.19 needs the following lemma. 


Lemma 8.2.20. The power function, defined by eq. (8.1), is a nondecreasing function on 
the set {0,..., N}. 


Proof: Suppose we get a delivery of N machines containing M defective ones. Now 
there are not only defective machines within the delivery, but also M — M false ones 
for some M > M. We take a sample of size n and test these machines. Let X be the 
number of defective machines and let X be the number of machines that are either 
defective or false. Of course, we have X < X implying P(X > mo) < P(X > mo). Note that 
X is Hy,m.n-distributed while X is distributed according to H, Nin" These observations 
lead to 


Br(M) = Hy,uyn({mo + 1,...,n}) = P{X > mo} < P{X > mo} 
= Hy ig {mo FB Dp ave ary n}) = Br(M) . 
This being true for all M < M proves that Br is nondecreasing. | 


Let us come back to the proof of Proposition 8.2.19. Set % := {0,..., mo}, thus % = 
{mo + 1,...,n} for some (at the moment arbitrary) mo < n. Because of Lemma 8.2.20, 
the test T = (Xo, %1) is an a-significance test if and only if it satisfies 


= Gl Gon! 


» 


m=mo+1 


= = Hy, Mo,n(%1) = sup Hy mn(%1) < a. 
(7) M<Mo 


n 

To minimize the probability for the occurrence of a type II error, we have to choose 1; 
as large as possible or, equivalently, mo as small as possible, that is, if we replace mo 
by mo -1, then the new test is no longer an a-test. Thus, in order that T is an a-test that 
minimizes the probability for a type II error, the number mo has to be chosen such that 


> Gan) a) <a and yy Gocn) a) >a. 


m=mo+1 (") m=mo (") 


This completes the proof. a 


Example 8.2.21. A buyer gets a delivery of 100 machines. In the case that there are 
strictly more than 10 defective machines in the delivery, he will reject it. Thus, his 
hypothesis is Hp : M < 10. In order to test Ho, he chooses 15 machines and checks 
them. Let m be the number of defective machines among the checked ones. For which 
m does he reject the delivery with significance level a = 0.01? 
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Answer: We have N = 100, Mo = 10, and n = 15. Since a = 0.01, by 


in) (5) > (in) (ism) 

yo am = 0.0063---<a and SB _004.-->a, 
m=5 (is) m=4 (5) 

it follows that the optimal choice is mp = 4. Consequently, we have X% = {0,..., 4}, 

thus, %, = {5,..., 15}. If there are 5 or even more defective machines among the tested 

15 ones, then the buyer should reject the delivery. The probability that his decision is 


wrong is less than or equal to 0.01. 
What can be said about the probability for a type II error? For this test we have 


15 (™) Co) 


Br(M) = > m/ \ 15-m 


100 , 
m=5 ( 15 ) 
hence 
4 /My100-M 
1 Br(M) _ a (n)(s-m ) 

m=0 ( 15 ) 

Since Br is nondecreasing, 1 — By is nonincreasing, and the probability for a type II 
error becomes maximal for M = 11. Recall that Q9 = {0,...,10} and, therefore, 0, = 
{11,..., 100}. Thus, an upper bound for the probability of a type II error is given by 


+ Oe) 
1- Br(M) < 1-r(11) = 59 an 
m=0 ( 15 ) 
This tells us that even in the case of most powerful tests the likelihood for a type II 
error may be quite large. Even if the number of defective machines is big , this error 
may occur with higher probability. For example, we have 


= 0.989471, M=11,...,100. 


1 — Br(20) = 0.853089 or 1 -y(40) = 0.197057. 


Important Remark: An a-significance test provides us with quite precise information 
when rejecting the hypothesis Ho. In contrast, when we observe a sample x € 4%, 
then the only information we get is that we failed to reject Ho, thus, we must continue 
to regard it as true. Consequently, whenever fixing the null hypothesis, we have to 
fix it in a way that either a type I error has the most serious consequences or that 
we can achieve the greatest information by rejecting Ho. Let us explain this with two 
examples. 


Example 8.2.22. A certain type of food sometimes contains a special kind of poison. 
Suppose there are y milligrams poison in one kilogram of the food. If u > wo, then 


8.3 Tests for Binomial Distributed Populations —— 315 


eating this becomes dangerous while for p < Yo it is unproblematic. How do we suc- 
cessfully choose the hypothesis when testing some sample of the food? We could take 
either Hp : Ww > Uo Or Ho : W < Mo. Which is the right choice? 

Answer: The correct choice is Hp : uw > Uo. Why? If we reject Ho, then we can be 
very sure that the food is not poisoned and may be eaten. The probability that someone 
will be poisoned is less than a. A type II error occurs if the food is harmless, but we 
discard it because our test tells us that it is poisoned. That results in a loss for the 
company that produced it, but no one will suffer from poisoning. If we had chosen 
Ho :  < Mo, then a type II error occurs if Ho is false, that is, the food is poisoned, but 
our test says that it is eatable. Of course, this error is much more serious, and we have 
no control in regards to its probability. 


Example 8.2.23. Suppose the height of 18-year-old males in the US is normally dis- 
tributed with expected value y and variance o* > 0. We want to know whether 
the average height is above or below 6 feet. There is strong evidence that we will 
have py < 6, but we cannot prove this. To do so, we execute a statistical experiment 
and choose randomly n males of age 18 and measure their height. Which hypothesis 
should be checked? If we take Ho : p < 6, then it is very likely that our experiment will 
lead to a result that does not contradict this hypothesis, resulting in a small amount 
of information gained. But, if we work with the hypothesis Hp : p > 6, then a rejection 
of this hypothesis tells us that Ho is very likely wrong, and we may say the conjecture 
is true with high probability, namely that we have py < 6. Here the probability that our 
conclusion is wrong is very small. 


8.3 Tests for Binomial Distributed Populations 


Because of its importance we present tests for binomial distributed populations in a 
separate section. The starting point is the problem described in Examples 8.1.3 and 
8.1.8. In a single experiment we may observe either “O” or “1,” but we do not know 
the probabilities for the occurrence of these events. To obtain some information about 
the unknown probabilities we execute n independent trials and record how often “1” 
occurs. This number is B,,9-distributed for some 0 < 6 < 1. Hence, the describing 
statistical model is given by 


(VY, P(X), Bn,oecfo,] Where = {0,...,n}. (8.2) 
Two-Sided Tests: We want to check whether the unknown parameter 6 satisfies 0 = 09 
or 6 # Op for some given 9p « [0, 1]. Thus, @p = {09} and ©, = [0, 1]\{0@}. In other words, 
the null and the alternative hypothesis are 


Ho : 8 = 4 and H, :0# 4, 


respectively. 
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To construct a suitable a-significance test for checking Hp we introduce two num- 
bers no and n, as follows. Note that these numbers are dependent on 49 and, of course, 
also on a. 


k 
No :=minik<n: > (7), (1- @)"7 > a/2 
j=0 


n 
maxik<n: 


j g(a - Q)"7 <a/2$ and (8.3) 


ny :=maxik<n: (j)ebca- 00)" > a 


jek J 
n n : ' 
=minik<n: ( "eh -O)"7 <a/2}. (8.4) 
j=k+1 J 


Proposition 8.3.1. Regard the statistical model (8.2) and let 0 < a < 1 be a significance 
level. The hypothesis test T = (Xo, X;) with 


Xp :={no, No +1,...,2,-1,m} and %X,={0,...,n9 -1}U{nm,+1,...,n} 


is an a-significance test to check Hp : 8 = 09 against H, : 0 # 00. Here no and n, are 
defined as in eqs. (8.3) and (8.4). 


Proof: Since Qo consists only of the point {@9}, an arbitrary test T = (Xp, 4) is an 
a-significance test if and only if B, g,(41) < a. Now let T be as in the formulation of the 
proposition. By the definition of the numbers no and n; we obtain 


no-1 


Bn,6) (41) = ~ (‘)eba = A)" 4: > (‘Ja (i AJM < s+ : =a, 
j=0 j=m+1 


that is, as claimed, the test T := (%, 4) is an a-significance test. Note that the re- 
gion %, is chosen maximal. In fact, neither np can be enlarged nor n; can be made 
smaller. fai 


Remark 8.3.2. In this test the critical region 1; consists of two parts or tails. There- 
fore, this type of hypothesis test is called two-sided test. 


Example 8.3.3. In an urn is an unknown number of white and black balls. Let 6 € [0, 1] 
be the proportion of white balls. We conjecture that there are as many white as black 
balls in the urn. That is, the null hypothesis is Hp : @ = 0.5. To test this hypothesis we 
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choose one after the other 100 balls with replacement. In order to determine no and n; 
in this situation let g be defined as 


wo S09). 


Numerical calculations give 


(37) = 0.00331856, (38) = 0.00601649, p(39) = 0.0104894 
(40) = 0.0176001, (41) = 0.028444, p(42) = 0.044313 
(43) = 0.0666053, p(44) = 0.096674, —p(45) = 0.135627 
(46) = 0.184101, p47) = 0.242059, v(48) = 0.30865, 

(49) = 0.382177, p(50) = 0.460205. 


If the significance level is chosen as a = 0.1 we see that p(42) < 0.05, but p(43) > 0.05. 
Hence, by the definition of no in eq. (8.3), it follows that np = 42. By symmetry, for 
n; defined in eq. (8.4), we get nm, = 58. Consequently, the regions of acceptance and 
rejection are given by 


Xo = {42, 43,...,57,58} and 2 ={0,...,41}U {59,..., 100}. 


For example, if we observe during 100 trials k white balls for some k < 42 or some 
k > 58, then we may be quite sure that our null hypothesis is wrong, that is, the number 
of white and black balls is significantly different. This assertion is 90% sure. 

If we want to be more secure about the conclusion, we have to choose a smaller 
significance level. For example, if we take a = 0.01, the values of g imply no = 37 and 
n, = 63, hence 


Xo = {37, 38,..., 62,63} and 2% ={0,...,36}U {64,...,100}. 


Again we see that a smaller bound for the probability of a type I error leads to an 
enlargement of Xo, thus, to an increase of the chance for a type II error. 


One-Sided Tests: Now the null hypothesis is Hp : 8 < 9 for some @p « [0,1]. In the 
context of Example 8.1.3 we claim that the proportion of white balls in the urn does not 
exceed 0. For instance, if 09 = 1/2, then we want to test whether or not the number of 
white balls is less than or equal to that of black ones. 

Before we present a most powerful test for this situation let us define a number no 
depending on 6p and on the significance level 0 < a < 1. 
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n 


No := maxik<n: (j))evc - A)" >a (8.5) 
pe VJ 
n WN ; 
=minjk<n: 2S. (‘)ea- A)" <a 
jek+1 


Now we are in a position to state the most powerful one-sided a-test for a binomial 
distributed population. 


Proposition 8.3.4. Suppose XY = {0,...,n}, and let (*, P(X), Bn,a)ecio,1] be the statist- 
ical model describing a binomial distributed population. Given 0 < a < 1, define no by 
eq. (8.5) and set Xp = {0,...,no}, hence X; = {no +1,...,n}. Then T = (Xo, 4) is the 
most powerful a-test to check the null hypothesis Hp : 8 < @9 against H, : @ > Oo. 


Proof: With an arbitrary 0 < n’ < n define the region of acceptance %p of a test T by 
Xo = {0,...,n’}. Then its power function is given by 


n 
Br(O) = Bno(%) = > (je -e", O<6<1. (8.6) 
jen’ +1 
To proceed further we need the following lemma. 


Lemma 8.3.5. The power function (8.6) is nondecreasing in [0, 1]. 


Proof: Suppose in an urn there are white, red, and black balls. Their proportions are 
6;, 82-0; and 1-62 for some 0 < 0; < 42 < 1. Choose n balls with replacement. Let X be 
the number of chosen white balls, and Y is the number of balls that were either white 
or red. Then X is By,9,-distributed, while Y is distributed according to By,9,. Moreover, 
X < Y, hence it follows that P(X > n’) < P(Y > n’), which leads to 


Br(O1) = Bro, ({n’ +1,..., n}) = P(X >) < PY > n’) 
= Brg, ({n' +1,...,n}) = Br(62). 


This being true for all 0; < 6, completes the proof. ia 


An application of Lemma 8.3.5 implies that the above test T is an a-significance test if 
and only if 


3 (;) 0h(1~ 80)" = Ba(6o) = sup Br(@) < a. 


jen'+1 


In order to minimize the probability of a type II error, we have to choose 1 as small 
as possible. That is, if we replace n’ by n’ — 1, the modified test is no longer an a-test. 
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Thus, the optimal choice is n’ = no where np is defined by eq. (8.5). This completes the 
proof of Proposition 8.3.4. a 


Example 8.3.6. Let us come back to the problem investigated in Example 8.1.3. Our 
null hypothesis is Ho : 6 < 1/2, that is, we claim that at most half of the balls are white. 
To test Hy, we choose 100 balls and record their color. Let k be the number of observed 
white balls. For which k must we reject Ho with a security of 90%? 

Answer: Since 


100 100 


100 100 
> ( ; )zr° = 0.135627. and a ( je = 0.096674, 
k=56 k=57 


for a = 0.1 the number no in eq. (8.5) equals no = 56. Consequently, the region of 
acceptance for the best 0.1-test is given by Xp = {0,...,56}. Thus, whenever there are 
57 or more white balls among the chosen 100 the hypothesis has to be rejected. The 
probability for a wrong decision is less than or equal to 0.1. 

Making the significance level smaller, for example, taking a = 0.01, this implies 
No = 63. Hence, if the number of white balls is 64 or larger, a rejection of Ho is 
99% sure. 


Remark 8.3.7. Example 8.3.6 emphasizes once more the dilemma of hypothesis test- 
ing. The price one pays for higher security, when rejecting Ho, is the increase of the 
likelihood of a type II error. For instance, replacing a = 0.1 by a = 0.01 in the previ- 
ous example leads to an enlargement of 1p from {0, ..., 56} to {0,..., 63}. Thus, if we 
observe 60 white balls, we reject Ho in the former case, but we cannot reject it in the 
latter one. This once more stresses the fact that an observation of an x € Xp does not 
guarantee that Hp is true. It only means that the observed sample does not allow us to 
reject the hypothesis. 


8.4 Tests for Normally Distributed Populations 


During this section we always assume 1 = R". That is, our samples are vectors 
xX = (%1,...,Xn) with x; ¢ R. Given a sample x ¢ R”, we derive from it the following 
quantities that will soon play a crucial role. 


Definition 8.4.1. If x = (x4, ...,Xn) € R"”, then we set 


ee ae 2 sees 7 
i > ienake 4 ) (x; - x)? and 02 := : > (xj -x)?. (8.7) 
jal 


je j=l 


320 —— 8 Mathematical Statistics 


The number x is said to be the sample mean of x, while s2 and o? are said to 
be the unbiased sample variance and the (biased) sample variance of the 
vector x, respectively. 


Analogously, if X = (X;,...,Xy) is an n-dimensional random vector,? then we define 
the corresponding expressions pointwise. For instance, we have 


XW) = — XW) and s(w) = —— ow) - XW)? 
j=l rl 


8.4.1 Fisher’s Theorem 

We are going to prove important properties of normally distributed populations. They 
turn out to be the basis for all hypothesis tests in the normally distributed case. The 
starting point is a crucial lemma going back to Ronald Aylmer Fisher. 

Lemma 8.4.2 (Fisher’s lemma). Let Y,,..., Y, be independent N (0, 1)-distributed ran- 


dom variables and let B = (Bi) pat be a unitary n x n matrix. The random variables 
Zi, ...,Zn are defined as 


n 
Rey BM, Meier: 
jel 


They possess the following properties. 


(i) The variables Z,,...,Z, are also independent and N (0, 1)-distributed. 
(ii) Form <nlet the quadratic form Q on R" be defined by 


n m 
Q= DY Dz. 
jel i=1 


Then Q is independent of all Z,,...,Zm and distributed according to a 


Proof: Assertion (i) was already proven in Proposition 6.1.16. 


3 To simplify the notation, now and later on, we denote random vectors by X, not by X as we did 
before. This should not lead to confusion. 
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Let us verify (ii). The matrix B is unitary, thus it preserves the length of vectors in 
R". Applying this to Y = (Y%,..., Y,) and Z = BY gives 


n 
7 = 2b Hb We= > ¥, 


i=l jel 
which leads to 
Qa Zag tz « (8.8) 


By virtue of (i) the random variables Z,,...,Z, are independent, hence by eq. (8.8) 
and Remark 4.1.10 the quadratic form Q is independent of Z;,..., Zm. 

Recall that Zmni1,...,Zn are independent (0, 1)-distributed. Thus, in view of 
eq. (8.8), Proposition 4.6.17 implies that Q is X2_m- distributed. Observe that Q is the 
sum of n — m squares. | 


Now we are in a position to state and prove one of the most important results in 
Mathematical Statistics. 


Proposition 8.4.3 (Fisher’s theorem). Suppose X, ...,X» are independent and distrib- 
uted according to N'(u, 07) for some up « R and some o? > O. Then the following are 
valid: 


xX- 
Jn = Fis N (0, 1)-distributed . (8.9) 
2 
(n-1) a is ?_,-distributed. (8.10) 
X-y aes 
Jn = is ty_1-distributed , (8.11) 
x 


where Sx := +, ee Furthermore, X and Se are independent random variables. 


Proof: Let us begin with the proof of assertion (8.9). Since the X;s are independent 
and NV (u, o)-distributed, by Proposition 4.6.18 their sum X; + - - - + X, possesses an 
N (nw, no?) distribution. Consequently, an application of Proposition 4.2.3 implies that 
X is N(u, o?/n)-distributed, hence, another application of Proposition 4.2.3 tells us 
that * is standard normal. This completes the proof of statement (8.9). 


o/./n 
We turn now to the verification of the remaining assertions. Letting 


Yj:= ——, 1<j<n, (8.12) 
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the random variables Y;,..., Y, are independent (0, 1)-distributed. Moreover, their 
(unbiased) sample variance may be calculated by 


1 n _ 1 n on 7 

ey, : 2 2 . 2, 

Sy naa 9 reer DY ee 
2 a if 


n n 
2 es 1 5 
) Y? = 2nyY? en y? #=1 ) Y} = (J/n vy . (8.13) 
jel jel 


-1 


= 


To proceed further, set b; := (n-¥?,...,n-/2), and note that b; is a normalized 
n-dimensional vector, that is, we have |b,|2 = 1. Let E ¢ R" be the (n — 1)-dimensional 
subspace consisting of elements that are perpendicular to b;. Choosing an orthonor- 
mal basis b2,..., by, in E, then, by the choice of E, the vectors b,,..., by, form an 
orthonormal basis in R". If bj = (Bi, ..., Bin), 1 < i <n, let B be the n x n-matrix with 
entries Bj, that is, the vectors b;, ..., by are the rows of B. Since (bj), are orthonormal, 
Bis unitary. 
As in Lemma 8.4.2, define Z,..., Zn by 


n 
Zii= > ByYj, 1sisn, 
jel 
and the quadratic form Q (with m = 1) as 
n 
Q:= >) ¥?-Z. 
jel 


Because of Lemma 8.4.2, the quadratic form Q is y?_,-distributed and, furthermore, it 
is independent of Z;. By the choice of B and of hi, 


Bu=-- -=Bin=n?, 


hence Z, = n"/2 Y, and by eq. (8.13) this leads to 


Q=o¥;-@? YP =(n-Dsy. 


j=l 


This observation implies (n — 1) s¥ is x7_,-distributed and, moreover, (n — 1)s} and Z; 
are independent, thus also se and Z;. 
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The choice of the Yjs in eq. (8.12) immediately implies Y= ae hence 


= 2 
= - (hae Xn Sy 
(n-5} = - FF »( a - Key, 


= rl! 


which proves assertion (8.10). | 

Recall that Z; = n!/2Y¥ = nex, which leads to X = n-? oZ, + ps. Thus, because 
of Proposition 4.1.9, the independence of s¥, = s/o? and Z; implies that s% and X are 
independent as well. 

It remains to prove statement (8.11). We already know that V := ./n xu is standard 
normal, and W := (n - 1)s}/o? is x7_,-distributed. Since they are independent, by 
Proposition 4.7.8, applied with n — 1, we get 


X- V 
Jn Ee is tp-1-distributed. 
Sx 1w 
n-1 
This implies assertion (8.11) and completes the proof of the proposition. a 


Remark 8.4.4. It is important to mention that the random variables Xj, ..., Xn satisfy 
the assumptions of Proposition 8.4.3 if and only if the vector (X1,..., Xn) is N(y, 07)®"- 
distributed or, equivalently, if its probability distribution is NV (ji, 07In). 


8.4.2 Quantiles 


Quantiles may be defined in a quite general way. However, we will restrict ourselves 
to those quantiles that will be used later on. The first quantiles we consider are those 
of the standard normal distribution. 


Definition 8.4.5. Let © be the distribution function of \V(0,1), as it was intro- 
duced in Definition 1.62. For a given B « (0, 1), the B-quantile Zp of the standard 
normal distribution is the unique real number satisfying 


(zg) = B or, equivalently, zg = © '(). 
Another way to define zg is as follows. Let X be a standard normal random variable. 
Then Ze is the unique real number such that 
P{X < Zh = B ‘ 


The following properties of zg will be used later on. 
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Proposition 8.4.6. Let X be standard normally distributed. Then the following are valid 
1. We have zg < 0 for B < 1/2 and zg > 0 for B > 1/2. 

2. P{X > zg} =1-B. 

3. If0<f <1, thenz,_, = —zp. 

4. ForO<a<1wehave P{|X| > Z-ep}=a. 


Proof: The first property easily follows by ®(0) = 1/2, hence ®(¢) > 1/2 if and only if 
t>0. 

Let X be standard normal. Then P{X > zg} = 1— P{X < zg} = 1—B, which proves the 
second assertion. 

Since —X is standard normal as well, by 2. it follows that 


P{X < —zp} al P{X > Zh =] -B = P{X < Zp}, 


hence z;_g = —Zg as asserted. 
To prove the fourth assertion note that properties 2 and 3 imply 


P{|X| 2 Z1-aj2h =P{IX< —Z1-a/2 OF X2 2-aj2h 
= P{X < -2-a/2} + P{X > Z1-a/2} 
= PIX < Zp} + PIX > Yap} =a/2 + a/2=a. 


Here we used 1 - a/2 > 1/2 implying z;-/2 > 0, hence the events {X < —z,-qj2} and 
{X > Z-a/2} are disjoint. a 


The next quantile, needed later on, is that ofa x distribution. 


Definition 8.4.7. Let X be distributed according to x2 and let 0 < 8 < 1. The unique 
(positive) number y?, , Satisfying 


PIX < Xng} = B 


is called B-quantile of the y2-distribution. 


Two other, equivalent, ways to introduce these quantiles are as follows. 
1. IfX;,..., Xn are independent standard normal, then 


P(X? +--+ +X? <Xnspt = B- 
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2. The quantile y?. B satisfies 


1 i B 

2 
ae ee ff Pd = B, 
2n2T(n/2) [ B 


For later purposes we mention also the following property. If 0 < a < 1, then for any 
x2-distributed random variable X, 


PIX ¢ beam Nevo =a. 


In a similar way we define now the quantiles of Student’s t, and of Fisher’s 
Fmn distributions. For their descriptions we refer to Definitions 4.76 and 4.713, 
respectively. 


Definition 8.4.8. Let X be ¢,,-distributed and let Y be distributed according to Finn. 
For B « (0, 1) the B-quantiles ty,g and Fin,n;g Of the ty- and F,n-distributions are the 
unique numbers satisfying 


P{X <tng}=B and PLY < Frnpt=B. 


Remark 8.4.9. Let X be ¢, distributed. Then —X is t, distributed as well, hence 
P{X <s} = P{-X < s} fors ¢ R. Therefore, as in the case of the normal distribution, 
we get —fn.g = tn;1-g, and also 


P{|X| > tnr-aj2} = P{IX| 2 that =a. (8.14) 


Remark 8.4.10. Another way to introduce the quantiles of the Fi,-distribution is 
as follows. Let X and Y be independent and distributed according to x7, and y?, 
respectively. The quantile Fy, n,g is the unique number satisfying 


X/m 7 
Pl ae < Fp] =f. 


Ifs > 0, then 


X/m Y/n 1 Y/n 1 
Pi om ss}-Pi gee stat Pl am <5} 


which immediately implies 
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8.4.3 Z-Tests or Gauss Tests 


Suppose we have an item of unknown length. In order to get some information about 
its length, we measure the item n times with an instrument of known accuracy. As 
sample we get a vector x = (X1,..., Xn), where x; is the value obtained in the jth meas- 
urement. These measurements were executed independently, thus, we may assume 
that the xjs are independent \/(u, 03)-distributed with known of > 0 and unknown 
length p € R. Therefore, the describing statistical model is 


(R", B(R"), N(y, 06)2") ep = (R" BIR"), NG, obIn)) 


yeR yeR * 


From the hypothesis, two types of tests apply in this case. We start with the so-called 
one-sided Z-test (also called one-sided Gauss test). Here the null hypothesis is Ho : 
Le < Ho, where Yo € Ris a given real number. Consequently, the alternative hypothesis 
is H, : pu > Wo, that is, @g = (co, Wo] while ©; = (yo, oo). In the above context this 
says that we claim that the length of the item is less than or equal to a given Wo, and to 
check this we measure the item n times. 


Proposition 8.4.11. Let a « (0, 1) be a given security level. Then T = (Xo, %) with 
Xo := {x ER": x< po +n? a9 z1-a| 

and with 

1/: 


X= {xe R":xX>potn *a021-a| 


is an a-significance test to check Ho against H,. Here z;-, denotes the (1 — a)-quantile 
introduced in Definition 8.4.5. 


Proof: The assertion of Proposition 8.4.11 says that 


sup P, (44) = sup N(y, 06)®"(A1) < a. 
HSHOo HSHO 


To verify this, let us choose an arbitrary p < Wo and define S : R” > R by 


SO) := Vn oe , xeR", (8.15) 
10) 


Regard S as a random variable on the probability space (R", B(R"), N(u, 03)®"). Then, 
by property (8.9), it is standard normally distributed.4 Consequently, 


4 This fact is crucial. For better understanding, here is a more detailed reasoning. Define random 
variables X; on the probability space (R", B(R”), N(u, 03)®") by Xj(x) = xj, where x = (4,...,Xn). 
Then the random vector X = (X1,...,Xn) is the identity on R”, hence V(y, 02)®"-distributed. In view 
of Remark 8.4.4 and by 


Si) = va BE, 


assertion (8.9) applies for S, that is, it is (0, 1)-distributed. 
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N (yu, 06) 2"{x € R" : SQ) > Zab = a. (8.16) 
Since pl < Uo, we have 


X= {x eR": x>puo +n? ooZ1-a| 


c {xe R™:x>p4+n? oo za} = {x € R": S(x) > Za}, 
hence, by eq. (8.16), it follows that 
N(u, 06)2°"(%1) < Ny, 06)®"{x € R" : S(X) > Zab =a. 


This completes the proof. o 


What does the power function of the Z-test in Proposition 8.4.11 look like? If S is as in 
eq. (8.15), then, according to Definition 8.2.11, 


Br(w) = N(u, 09)2"(41) = NM, 0G)" {x eR": dae > a-a| 


= N08)" {x € R": SX) > Za + Uo -p) | 


=1-0 (20+ o-W) Wt) = © (2.+ wo) ) 


n 
00 


In particular, By is increasing on R with By(uWo) = a. Moreover, we see that Br(y) < aif 
HL < Mo, and Br(y) > a for p > Uo. 


1.0 


0.8 


1 2 3 4 


Figure 8.1: Power function of T with a = 0.05, Wo = 2, 09 = 1, andn = 10. 
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While the critical region of a one-sided Z-test is an interval, in the case of the two- 
sided Z-test it is the union of two intervals. Here the null hypothesis is Ho : uw = Wo, 
hence the alternative hypothesis is given as Hj : pu # Mo. 
Proposition 8.4.12. The test T = (Xo, %), where 

X i= {x ER": po — 1? 69 Z-ap <¥ < Wo +n? oo 21-ap} 
and X, = R"\ Xo, is an a-significance test to check Ho : & = Uo against Hy, : uw # Uo. 
Proof: Since here Qo = {uo}, the proof becomes easier than in the one-sided case. We 
only have to verify that 


N (uo, 06)2"(X1) < a. (8.17) 


Regarding S, defined by 


S6)caa* = 


[) 


as random variable on (R", B(R"), N (uo, 04)®"), by the same arguments as in the pre- 
vious proof, it is standard normally distributed. Thus, using assertion 4 of Proposition 
8.4.6, we obtain 


N (uo, 09)2°"(A1) = N (Ho, 06) 2"{x € R" : |SQO| > Z-apt =a. 


Of course, this completes the proof. o 


8.4.4 t-Tests 


The problem is similar to the one considered in the case of the Z-test. But, there is one 
important difference. We do no longer assume that the variance is known, which will 
be so in most cases. Therefore, this test is more realistic than the Z-test. 

The starting point is the statistical model 


(R", B(R"), N(u, 07)?" ,02)eRx(0,00) : 


Observe that the unknown parameter is now a vector (1, 0*) € R x (0, oo). We begin by 
investigating the one-sided t-test. Given some [lo € R, the null hypothesis is as before, 
that is, we have Ho : y < po. In the general setting that means Oo = (—0#, plo] x (0, 00), 
while 0; = (Uo, oo) x (0, 00) 
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To formulate the next result, let us shortly recall the following notations. If s? 
denotes the unbiased sample variance, as defined in eq. (8.7), then we set sy := +,/S¢. 
Furthermore, tn-1:1-« denotes the (1—a)-quantile of the t,_1-distribution, as introduced 
in Definition 8.4.8. 


Proposition 8.4.13. Given a « (0,1), the regions Xo and % in R" are defined by 


Xo := {x eR™:X<Uot nt Sx tn-t1-a 


and X; = R"\Xp. With this choice of Xo and X,, the test T = (Xo, %) is an a-significance 
test for Ho : MU < Uo against Hl; : pu > Ho. 


Proof: Given yi < yo, define the random variable S on (R", B(R"), N(y, 07)®") as 


S(x) = van, xeR". 


x 


Property (8.11) implies that S is t,_;-distributed, hence by the definition of the quantile 
tn-1:1-a, it follows that 


N(u, 07)®"{x ¢ R" : S(x) > tr-ts-al =a. 
From yl < Lo we easily derive 
&% S {x ER": SX) > th-is-a}, 
thus, as asserted, 


sup N(y, 07)®"(A4) < N(y, 07)®"{x € R” : S(x) > thas} = a. 
SHO 


| 
As in the case of the Z-test, the null hypothesis of the two-sided t-test is Ho : yw = Uo 
for some Uo € R. Again, we do not assume that the variance is known. 


A two-sided t-test with significance level a may be constructed as follows. 


Proposition 8.4.14. Given a « (0,1), define regions X% and * in R" by 


x- 
Xo := {xeR": vi 


Sx 


0 
< trss-aa| 


and X, = R"\Xo. Then T = (Xo, 4) is an a-significance test for Ho : WU = Uo against 
Hy oy # Ho. 
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Proposition 8.4.14 is proven by similar methods, as we have used for the proofs of 
Propositions 8.4.12 and 8.4.13. Therefore, we decline to prove it here. 


Example 8.4.15. We claim acertain workpiece has a length of 22 inches. Thus, the null 
hypothesis is Ho : w = 22. To check Ho, we measure the piece 10 times under the same 
conditions. The 10 values we obtained are (in inches) 


22.17, 22.11, 22.10, 22.14, 22.02, 21.95, 22.02, 22.08, 21.98, 22.15 


Do these values allow us to reject the hypothesis or do they confirm it? 
We have 


x-22 
Sx 


xX =22.072 and s, =0.07554248, hence 10 = 3.013986. 

If we choose the security level a = 0.05, we have to investigate the quantile to.0.975, 
which equals to.0.975 = 2.26. This lets us conclude the observed vector x = (%4, ..., X10) 
belongs to *%;, and we may reject Ho. Consequently, with a security of 95% we may say, 
p# 22. 


Remark 8.4.16. If we plug these 10 values into a mathematical program, the result will 
be a number ap = 0.00128927. What does this number tell us? It says the following. If 
we have chosen a significance level a with a > ao, then we have to reject Ho. But, if 
the chosen a satisfies a < ao, then we fail to reject Ho. In our case we had a = 0.05 > 
0.00128927 = ao, hence we may reject Ho. 


8.4.5 y?-Tests for the Variance 


The aim of this section is to get some information about the (unknown) variance of 
a normal distribution. Again we have to distinguish between the following two cases. 
The expected value is known or, otherwise, the expected value is unknown. 

Let us start with the former case, that is, we assume that the expected value is 
known to be some flo € R. Then the statistical model is (R", B(R”), N (Uo, 07)®") 425: 
In the one-sided y-test, the null hypothesis is Ho : 07 < 0}, for some given 03 > 0, 
while in the two-sided y2-test we claim that Hp : 0? = 04. 


Proposition 8.4.17. In the one-sided setting, an a-significance x?-test T = (Xo, X;) is 
given by 


n 
(xj - Ho)? 
X= xe R™: SUE cy? 
jl n 
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For the two-sided case, choose 


7uo f2 (x UG — Ho)” = Uo)? 
Xo := 4x eR" ? Xnea/2 S — =) sie aj2f > 
(0) 


to obtain an a-significance test. In both cases the critical region is X, := R"\Xo. 


Proof: We prove the assertion only in the (slightly more difficult) one-sided case. For 
an arbitrarily chosen o? < 04, let (Uo, 0”)®" be the underlying probability measure. 
We define now the random variables X; : R" > Ras X;(x) = x; for x = (%41,..., Xn). Then 
Xj-HO 
oO 


the X;s are independent (uo, o*)-distributed. The normalization Yj := leads to 


independent standard normal Y;s. Thus, if 


(X} - a . 2 
soe 
jel 
then by Proposition 4.6.17, the random variable S is y2-distributed. By the definition of 
quantiles, we arrive at 
N (uo; 07)®" {x € R" : S(X) > Xng-a} = a. 

Since 0? < 03, it follows that 

X, © {xe R": SX) > Xng-a} ; 


hence N (uo, 07)®"(41) < a. This proves, as asserted, that T = (4%, 44) is an 
a-significance test. oO 


Let us now turn to the case where the expected value is unknown. Here the statistical 
model is given by 


(R”, B(R"), N(u, 07) >") ,02)eRx(0,00) . 


In the one-sided case, the parameter set 0 = R x (0, 00) splits up into 9 = Op u QO; 
with 


Qo = Rx 0,08] and ©, =R-x (0%, 0). 
In the two-sided case we have 


Qo = Rx {og} and ©; =R-x [(0,09) U (0G, o0)]. 
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Proposition 8.4.18. In the one-sided case, an a-significance test T = (Xo, Xj) is given by 
s2 
Xo := {x é¢ R": (n-1) 2 ciara} . 
0 
In the two-sided case, choose the region of acceptance as 
n,y2 Sy 9 
Xo := {x eR : Xnaj2 < (n-1) | 20 sea} 
0 


to get an a-significance test. Again, the critical regions are given by X, := R"\Xo. 


Proof: The proof is very similar to the one of Proposition 8.4.17, but with some 
important difference. Here we have to set 


2 
SQ) = (n-)S; xeR", 


Then property (8.10) applies, and it lets us conclude that Sis x?_,-distributed, provided 
that (yu, 07)®” is the true probability measure. After that observation the proof is 
completed as the one of Proposition 8.4.17. a 


8.4.6 Two-Sample Z-Tests 


The two-sample Z-test compares the parameters of two different populations. Suppose 
we are given two different series of data, say x = (x1,...,%m) andy = (1,..., Vn), which 
were obtained independently by executing m experiments of the first kind and n of the 
second one. Combine both series to a single vector (x, y) « R™™". 

A typical example for the described situation is as follows. A farmer grows grain 
on two different acres. On one acre he added fertilizer, on the other one he did not. 
Now he wants to figure out whether or not adding fertilizer influenced the amount 
of grain gathered. Therefore, he measures the amount of grain on the first acre at m 
different spots and those on the second one at n spots. The aim is to compare the mean 
values in both series of experiments. 

We suppose that the samples x;,...,Xm of the first population are independent 
and N(y1, o7)-distributed, while the yi, ..., yn of the second population are independ- 
ent and V(yo, 05)-distributed. Typical questions are as follows. Do we have py = yo OF, 
maybe, only 4, < 2? One may also ask whether or not 07 = 03 or, maybe, only 07 < 03. 

To apply the two-sample Z-test, one has to suppose the variances of and 03 as 
known. This reduces the number of parameters from 4 to 2, namely to py and po in R. 
Thus, the describing statistical model is given by 


(R™", B(R™"), N (U1, 07)®" ® N (pa, 03)®") (8.18) 


(M1,ma)eR2 * 
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Recall that Vu, o7)®” ® N (uo, 05)®" denotes the multivariate normal distribution 


with expected value (y,..., 1, }2,..-,}2) and covariance matrix R = (rj)7.*", where 
$e irl 


m n 
rg = Op if1<i< m,and ry = 05 ifm <i< m+n. Furthermore, rj = 0 ifi ¢j. 


Proposition 8.4.19. The statistical model is that in (8.18). To test Ho : py < Hz against 


Hy 4 > pa, set 
mn 
Xo {es erm, | _&-9) <a.| 
no; + mo; 


and X, = R™"\ Xo. Then the test T = (Xo, X4) is an a-significance test for checking Ho 
against Hl. 
To test Ho : Wy = Wz against Hy, : py # po, let 


mn oe 
Xo i= {ess eR™: a = evan} 
noz + mos 


and X, = R™"\ Xo. Then the test T = (Xo, X4) is an a-significance test for checking Ho 
against HH. 


Proof: Since the proof of the two-sided case is very similar to that of the one-sided 
one, we only prove the first assertion. Thus, let us assume that Hp is valid, that is, we 
have py < 2. Then we have to verify that 


N (uu, 07)2™ @ N (ua, 05)2"(X1) < a. (8.19) 


To prove this, we investigate the random variables X; and Y; defined as X;(x, y) = x; 
and Y;(x, y) = y;. Since the underlying probability space is 


(R™", B(R™*), N (uu, oe" ® N (to, g2)e") ’ 


these random variables are independent and distributed according to V (ju, 07) and 


= 2 = 
N(}o, a), respectively. Consequently, X is V (i. +) distributed, while Y is distrib- 
2 a 2 
uted according to V (1 4). By the construction, X and Y are independent as well, 
= 2 
and moreover since -Y is V (-1» 2) -distributed, we conclude that the distribution 


= = 2 2 
of X - Y equals V (1 — pa, a + 2). Therefore, the mapping S : R’*”" > R defined by 


2 


2 -1/2 
Sty) = (+2) [Gite 7649) - Gn 1] 
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is standard normal. By the definition of the quantiles, this leads to 
N (ur, 07)®" ® N (yo, 05)°{% y) € R™™ : S(x,y) > Zab = a. (8.20) 


Since we assumed Hp to be correct, that is, we suppose jy < Yo, it follows that 


3 2\-V2 _ _ _ 
S(x,y) > (2+) Koy) Fay) = | ty) - Fey). 
m on noz + mo; 


Hence 


XC {x y)e R™™" : Sx, y) > Z1-ah, 
which by eq. (8.20) implies estimate (8.19). This completes the proof of this part of the 


proposition. a 


8.4.7 Two-Sample f-Tests 


The situation is similar as in the two-sample Z-test, yet with one important difference. 
The variances o7 and 05 of the two populations are no longer known. Instead, we have 
to assume that they coincide, that is, we suppose 


GeeceHer. 


Therefore, there are three unknown parameters, the expected values 1, Wo, and the 
common variance o*. Thus, the statistical model describing this situation is given by 


2 2 
(R™, BIR™™), NG, 0)?" &@ Na, 07)?") G9 02)eR2x(0,00) * (8.21) 
To simplify the formulation of the next statement, introduce T : R™’*" + Ras 
m+n-2)mn x-y 
T(x, y) := ( ) y » &yeR™, (8.22) 


BEAT lm - 1)sz + (n= Ds} 
Proposition 8.4.20. Let the statistical model be as in (8.21). If 
Xo i= {(, ye R™" : T(x, y)< tm+n-2:1-a} 
and X, = R™"\X, then T = (Xo, 4) is an a-significance test for Hy : 4 < Hz against 
Hy : Wy > po. 
On the other hand, the test T = (Xo, X) with 


Xo := {(x, ye R™" : |TO, yl < tn+n-2;1-a/2} 


and X, = R™"\ Xp is an a-significance test for Ho : HW = 2 against Hy : wy # po. 
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Proof: This time we prove the two-sided case, that is, the null hypothesis is given by 
Ho: bi = p2. 

Let the random vectors X = (X;,...,Xm) and Y = (Y,..., Y,) on R™*” be defined 
with X;s and Yjs as in the proof of Proposition 8.4.19, that is, we have X(x,y) = x 
and Y(x,y) = y. Then by Proposition 4.1.9 and Remark 4.1.10, the unbiased sample 
variances 


2 1 


fy “ 
2 = ._ yy 
eres wai Le -X) and aera ae Y) 
re 


are independent as well. Furthermore, by virtue of statement (8.10), the random 
variables 


s s 
(m- NS and (n- vs 
are distributed according to y?,_, and x?_,, respectively. Proposition 4.6.16 implies that 
S; : 1)sx 1)s¥ 
(X,Y) *7 G2 {(m-1)sx + (n- 1sy} 


is X7,4n_9-distributed. Since s% and X as well as sj, and Y are independent, by Propos- 


ition 8.4.3, this is also so for Stx, Y) and X — Y. As in the proof of Proposition 8.4.19, it 


= Ss 2 
follows that X — Y is distributed according to V (i — po, “1 + 2). Assume now that 


Ho is true, that is, we have pi; = y2. Then the last observation implies that ee (X-Y) 
is a standard normally distributed random variable and, furthermore, independent of 


Six y)" Thus, by Proposition 4.7.8, the distribution of the quotient 
./mn mn a (X - Y) 
Hens 
<a Y) 


where S,x,y) := = +4/S% yy Stx, vy) is tm+n-2-distributed. If T is as in eq. (8.22), then it is not diffi- 
cult to prove that Z = T(X, Y). Therefore, by the definition of X and Y, the mapping T is 
a tm+n-2-distributed random variable on R™*”, endowed with the probability measure 
P = N (uy, 07)®™ @ N (2, 07)®". By eq. (8.14) this implies 


Msp2,02 
Py, p,02 (1) = Pa po,02 {(x, y)eR™" : |T(x,y)| > tnsn-2;1-a/2} =a 


as asserted. oO 
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8.4.8 F-Tests 


In this final section about tests we compare the variances of two normally distributed 

sample series. Since the proofs of the assertions follow the schemes presented in the 

previous propositions, we decline to verify them here. We only mention the facts that 

play a crucial role during the proofs. 

1. If X,...,Xm and Yj,...,Y, are independent and distributed according to 
N (1, 07) and NV (2, 03), then 


m 
V = 5 DXi my and W:= LS ie bo)? 
i 


03 5 


are y7, and y?-distributed and independent. Consequently, the quotient Ls Wn is 
Fm n-distributed. 

2. For X,...,Xm and Yj,...,Y, independent and standard normal, the random 
variables 


2 32 

(m-1)4 and (n-1) - 
0. 

i 07 


are independent and distributed according to y?,_, and x?_,, respectively. Thus, 
assuming 0) = 02, the quotient 52 /S, possesses an Fin-1,n-1-distribution. 


When applying an F-test, as before, two different cases have to be considered. 
(K) The expected values y, and p2 of the two populations are known. Then the 
statistical model is given by 


(Re B(R™"), N (wu, on ® N (to, 03)°") (o2,03)e(0,00)2 . 
(U) The expected values are unknown. This case is described by the statistical model 
(Rk, B(R™*), N (tu, ae" ® N (to, oy Vie }1207,05 o2)eR2x(0, oo)2 * 


In both cases the null hypothesis may either be Ho : of < 03 in the one-sided case 
or Ho : of = 03 in the two-sided one. The regions of acceptance in each of the four 
different cases are even by the following subsets of R’"*", and always 4; = R™"\%X. 
Case1: Ho : 07 < 05 and ju, pz are known. 


Lym (x. — y,)2 
O:= {inex mi Diet Xi i) < Finns | 


1 n 2 Ss 
n dj=10j — He) 
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Case 2: Hl : of = 03 and pu, po are known. 


FOG - ys)? 
Xo = }(%y) ER" : Fmnjap2 $ PS $$ << Finnst-ai2 
| ni Dajet 07 ~ Ha)? 


Case 3: Ho : of < 05 and jy, M2 are unknown. 
s2 
m+n 
X:=40y)eR™: 2 < Frn-tyn-1;1-a 


Case 4: [Ho : of = 05 and py, Wo are unknown. 


2 


Ss 
Xo := { v2 eal, dade cverme e P e 2 < Fins 
y 


8.5 Point Estimators 


Starting point is a parametric statistical model (1, F, Pg)geq. Assume we execute a 
statistical experiment and observe a sample x ¢ V. The aim of this section is to show 
how this observation leads to a “good” estimation of the unknown parameter 6 « 0. 


Example 8.5.1. Suppose the statistical model is (R”, B(R"), N(u, 06) 2" )uerR for some 
known 0, > 0. Thus, the unknown parameter is the expected value p< R. To estimate 
it, we execute n independent measurements and get x = (X1,...,Xn) € R". Knowing 
this vector x, what is a “good” estimation for w? An intuitive approach is to define the 
point estimator i: R"” > Ras 


5 i - 
BX) = ose x =(x%1,...,Xn) € R". 
jn 


In other words, if the observed sample is x, then we take its sample mean ji(x) = x as 
estimation for y. An immediate question is whether ji is a “good” estimator for p. Or 
do there exist maybe “better” (more precise) estimators for pi? 


Before we investigate such and similar questions, the problem has to be generalized 
slightly. Sometimes it happens that we are not interested in the concrete value of the 
parameter 6 < 0. We only want to know the value y(6) derived from @. Thus, for some 
function y : © > R we want to find a “good” estimator y : X > R for y(@). In other 
words, if we observe a sample x € 1’, then we take y(x) as estimation for the (unknown) 
value y(0). However, in most cases the function y is not needed. That is, here we have 
y(8@) = 0, and we look for a good estimator 6: X +O ford. 
Let us state an example where a nontrivial function y plays a role. 
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Example 8.5.2. Let (R", B(R"), Vw, OP) gu: 62)cta (0,20) be the statistical model. Thus, 
the unknown parameter is the two-dimensional vector (1, 0). But, in fact we are only 
interested in y, not in the pair (y, 07). That is, if 


y(u,o?) :=", (y,07) € Rx (0, 0), 


then we want to find an estimation for y(u, 0°). 
Analogously , if we only want an estimation for 0”, then we choose y as 


yu, 07) := 07, (U, 07) € Rx (0, 09). 


After these preliminary considerations, we state now the precise definition of an 
estimator. 


Definition 8.5.3. Let (7, F, Pe)g-q be a parametric statistical model and let y : 
© > R bea function of the parameter. A mapping y : Y > Ris said to bea point 
estimator (or simply estimator) for y(6) if, given t « R, the set {x « Y : p(x) < t} 
belongs to the o-field F. In other words, y is a random variable defined on ¥. 


The interpretation of this definition is as follows. If one observes the sample x €« 1, 
then y(x) is an estimation for y(0). For example, if one measures a workpiece four times 
and gets 22.03, 21.87, 22.11, and 22, 15 inches as results, then using the estimator ji in 
Example 8.5.2, the estimation for the mean value equals 22.04 inches. 


8.5.1 Maximum Likelihood Estimation 


Let (V, F, Pe)geo be a parametric statistical model. There exist several methods to con- 
struct “good” point estimators for the unknown parameter 0. In this section we present 
the probably most important of these methods, the so-called maximum likelihood 
principle. 

To understand this principle, the following easy example may be helpful. 


Example 8.5.4. Suppose the parameter set consists of two elements, say O = {0,1}. 
Moreover, also the sample space ¥ has cardinality two, that is, 7 = {a, b}. Then the 
problem is as follows. Depending on the observation a or b, we have to choose either 0 
or 1 as estimation for 0. 

For example, let us assume that Po({a}) = 1/4, hence Po({b}) = 3/4, and P,({a}) = 
P,({b}) = 1/2. Say, an experiment has outcome “a.” What would be a good estimation 
for 6 in this case? Should we take “0” or “1”? The answer is, we should choose “1.” 
Why? Because the sample “a” fits better to P, than to Po. By the same argument, we 
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should take “O” as an estimation if we observe “b.” Thus, the point estimator for 0 is 
given by 0(a) = 1 and 0(b) = 0. 


Which property characterizes the estimator 6 in Example 8.5.4? To answer this 
question, fix x « ¥ and look at the function 


6+ Pol{fx), O€0. (8.23) 


If x = a, this function becomes maximal for 6 = 1, while for x = bit attains its maximal 
value at 0 = 0. Consequently, the estimator 6 could also be defined as follows. For each 
fixed x « XY, choose as estimation the @ « Q, for which the function (8.23) becomes 
maximal. But this is exactly the approach of the maximum likelihood principle. 

In order to describe this principle in the general setting, we have to introduce the 
notion of the likelihood function. Let us first assume that the sample space -V consists 
of at most countably many elements. 


Definition 8.5.5. The function p from © x X to R defined as 
pO,x) = Pe(ix}), O¢0,xeXx, 


is called likelihood function of the statistical model (VY, P(%), Pg)ace. 


We come now to the case where all probability measures Pg are continuous. Thus, we 
assume that the statistical model is (R", BUR"), Pg)g-q and, moreover, each Po is con- 
tinuous, that is, it has a density mapping R” to R. This density is not only a function 
of x < R", it also depends on the probability measure P,, hence on 6 « O. Therefore, 
we denote the densities by p(6, x). In other words, for each @ € © and each box Q ¢ R" 
as in eq. (1.65) we have 


by bn 


PA = | p(,x)ax= [ , | f WO. aka) Btn (8.24) 


Definition 8.5.6. The function p : © x R" = R satisfying eq. (8.24) for all boxes Q 
and all 8 « @ is said to be the likelihood function of the statistical model 
(R", B(R"), Pe)eco- 


For a better understanding of Definitions 8.5.5 and 8.5.6, let us give some examples of 
likelihood functions. 
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1. First take (¥, P(X), Bnolocec: with Y = {0,...,n} from Section 8.3. Then its 
likelihood function equals 


pO, kb) = (era -oy"*, e[0,1], ke {0,...,n}. (8.25) 


2. Consider the statistical model (4, P(%), Hn,m,n)m-o. 
8.1.4. Then its likelihood function is given by 


n investigated in Example 


yay 


Cn) Crom) 
i) 


p(M, m) = , Me=o0O,...,N, m=0O,...,n. (8.26) 


3. The likelihood function of the model (Nj, P(NG), Pois?”)a>0 investigated in Ex- 
ample 8.5.21 is 


Akt +kn 
pA, ky, ...5kn) = cat eo’, A>0, ke No. (8.27) 
tas 


4. The likelihood function of (R", B(R"), V(u, 07) ®")1,02)eRx(0,00) from Example 8.1.12 
can be calculated by 


ee 
plu, 0, x) = ia ) ,weR,o2>0. (8.28) 


1 
(27)"/2g” p ( 202 


Here, as before, let ji = (yu, ..., 1). 
5. The likelihood function of (R", B(R"), Be") aso from Example 8.1.9 may be repres- 
ented as 


Meat +t) . t >0, A>o 


8.2 
0) : otherwise a82) 


Dstt) =| 


Definition 8.5.7. Let (Vv, F, P9)ecq be a parametric statistical model with likeli- 
hood function p : 0 x ¥ > R. An estimator 6 : X — @ is said to bea maximum 
likelihood estimator (MLE) for 6 < © provided that, for each x « 4, the following 
is satisfied: 


P(A(x), x) = max p(9, x) 
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Remark 8.5.8. Another way to define the MLE is as follows: 


6(x) =arg max p(0@,x), xeX. 
60 


How does one find the MLE for concrete statistical models? One observation is that 
the logarithm is an increasing function. Thus, the likelihood function p(-, x) becomes 
maximal at a certain parameter 6 < O if In p(., x) does so. 


Definition 8.5.9. Let (7, F, Pg)o<q be a statistical model and let p: Ox % + Rbe 
its likelihood function. Suppose p(6, x) > 0 for all (6, x). Then the function 
L(6,x) :=Inp(@,x), 90€0, xeX, 


is called log-likelihood function of the model. 


Thus, 6 is an MLE if and only if 


A(x) =arg max L(O@,x), xeX, 
dO 


or, equivalently, if 


L(6(x), x) = max L(6, x). 
0cO 


Example 8.5.10. If p is the likelihood function in eq. (8.25), then the log-likelihood 
function equals 


L(0,kK)) =c+kin@+(n-kIn-8@), O<@<1, k=0,...,n. (8.30) 
Here c € R denotes a certain constant independent of 0. 


Example 8.5.11. The log-likelihood function of p in eq. (8.29) is well-defined for A > 0 
and t; > 0. For those As and ¢js it is given by 


LO, t,...5t,) =nInA-A(t, +---+tn). 


To proceed further we assume now that the parameter set @ is a subset of RX for some 
k > 1. That is, each parameter @ consists of k unknown components, that is, it may be 


5 If f is a real-valued function with domain A, then x = arg max f(y) if x ¢ A and f(x) > f(y) for all 
yeA 


y € A. In other words, x is one of the points in the domain A where f attains its maximal value. 
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written as 0 = (6;,..., 6) with 6; « R. Furthermore, suppose that for each fixed x « Y 
the log-likelihood function L(-,x) is continuously differentiable® on ©. Then points 
6* « © where L(-, x) becomes maximal must satisfy 


0 : 
96,58 ee 0; tetysciisks (8.31) 


In particular, this is true for the MLE 6(x). If for each x € & the log-likelihood function 
L(., x) is continuously differentiable on © ¢ IR«, then the MLE @ satisfies 


a a 
96,68) Oy, BS, socks 


6=6(x) = 


Example 8.5.12. Let us determine the MLE for the log-likelihood function in eq. (8.30). 
Here we have 0 = [0,1] ¢ R, hence the MLE @: {0,...,n} — [0, 1] has to satisfy 


n-k _ 
Ak) 1-@(k) 


A. 
pl OW), k) 


This easily gives A(k) 7 k , that is, the MLE in this case is defined by 


Let us interpret this result. In an urn are white and black balls of unknown proportion. 
Let @ be the proportion of white balls. To estimate 0, draw n balls out of the urn, with 
replacement. Assume k of the chosen balls are white. Then @(k) = k is the MLE for the 
unknown proportion 6 of white balls. 


Example 8.5.13. The logarithm of the likelihood function p in eq. (8.28) equals 


n 
Doe 3 _ n 1 ; 2 
LQG.0",X) =LY,07 kay 25.5 Xp) SC 5 Ino oa 2 Bh) 
i. 


with some constant c < R, independent of p and of o”. Thus, here © ¢ R?, hence, if 
O* = (Ur, o?") denotes the pair satisfying eq. (8.31), then 


Ce) 
and —~L(u,07,x =0 
jg Es x) 


Gi, 0”, x) 
ou (u,02)=(u*,02") 


(u,02)=(u*,02") = 


6 The partial derivatives exist and are continuous. 
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Now 


a ha ft | see 
Gee 79) ag S05 - w= a Yixgg-mp |, 
jel jel 


which implies u* = + 01, xj = x. 
The derivative of L with respect to 0’, taken at y* = x, equals 


ui L(x, 0°, x) ee + : ae x)? 
doz?” 2 @ are : , 


It becomes zero at 02" satisfying 
1 n 
2* x2 
la ee =o, 
j= 


where o2 was defined in eq. (8.7). Combining these observations, we see that the only 
pair 6* = (u*,0?") satisfying eq. (8.31) is given by (x, 02). Consequently, as MLE for 
6 = (1, 0”) we obtain 


fi) =x and 0x) = 02, xeR", 


Remark 8.5.14. Similar calculations as in the previous examples show that the MLE 
for the likelihood functions in eqs. (8.27) and (8.29) coincide with 


7 1 ‘ 1 
Migscstg kre = ki and A(t,...,t,)=———. 
nd Ait fi 


Finally, we present two likelihood functions where we have to determine their max- 
imal values directly. Note that the above approach via the log-likelihood function does 
not apply if the parameter set © is either finite or countably infinite. In this case a de- 
rivative of L(-, x) does not make sense, hence we cannot determine points where it 
vanishes. 

The first problem is the one we discussed in Remark 1.4.27. A retailer gets a deliv- 
ery of N machines. Among the N machines are M defective ones. Since M is unknown, 
the retailer wants a “good” estimate for it. Therefore, he chooses by random n ma- 
chines and tests them. Suppose he observes m defective machines among the tested. 
Does this lead to an estimation of the number M of defective machines? The next 
proposition answers this question. 
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or 


the MLE M for M is of the form 


Heine | [m2] :m<n 
N 


:m=n 
Here [-] denotes the integer part of a real number, for example, [1.2] = 1 or [7] = 3. 


Proof: The likelihood function p was determined in eq. (8.26) as 
My (N-M 
m n-m 

Golan) 

N 
a) 
First note that p(M, m) # Oifand only if M « {m,..., N-n+m} and, therefore, it suffices 


to investigate p(M, m) for Ms in this region. Thus, if M —1 > m, then easy calculations 
lead to 


p(M, m) = 


p(M, m) M _N-M+1-(n-m) 


p(M-1,m) M-m N-M+1 


(8.32) 


By eq. (8.32) it follows that we have p(M, m) > p(M - 1, m) if and only if 
M(N-M+1-(n-m))>(M-m)(N-M+1). 
Elementary transformations show the last estimate is equivalent to 


—-nM >-mN-m, 


which happens if and only if M < morey) : 
Consequently, M + p(M,m) is nondecreasing on {o, ies [m2 ||, and it is 
nonincreasing on [ma | ieee NI}. Thus, if m < n, then the likelihood function 


M + p(M, m) becomes maximal for M* = [men |, and the MLE is given by 


5 M=O,o..57%= 1. 


A mN +1 

itn) = | ( 
n 

If m = n, then M + p(M,m) is nonincreasing on {0,...,N}, hence in this case the 

likelihood function attains its maximal value at M = N, that is, M (n) =N. ia 


Example 8.5.16. A retailer gets a delivery of 100 TV sets for further selling. He chooses 
by random 15 sets and tests them. If there is exactly one defective TV set among the 
15 tested, then the estimation for the number of defective sets in the delivery is 6. If 
he observes 2 defective sets, the estimation is 13, for 4 it is 26, and if there are even 
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6 defective TV sets among the 15 chosen, then the estimation is that 40 sets of the 
delivery are defective. 


Finally we come back to the question asked in Remark 1.4.29. In order to estimate 
the number N of fish in a pond one catches M of them, marks them and puts them 
back into the pond. After some time one catches fish again, this time n of them. 
Among them m are marked. Does this number m lead to a “good” estimation of 
the number of fish in the pond? To describe this problem we choose as statistical 
model 


(¥, P(X), Hy,m,n)N=0,1,... 


where ¥ = {0,...,n}. Here Hy,wn denotes the hypergeometric probability measure 
introduced in Definition 1.4.25. Thus, in this case the likelihood function is given by 


‘My (N-M 
Cel nan) 
N 
(n) 

In the sequel we have to exclude m = 0; in this case there does not exist a reasonable 

estimation for N. 


D(N, m) = , N=0,1,...,  m=0,...,n. 


Proposition 8.5.17. If1 < m<n, then the MLEN for N is 
~ Mn 
N(m) = =| F (8.33) 
m 


Proof: The proof is quite similar to that of Proposition 8.5.15. Since 


p(N,m) N-M N-n 
p(N -1,m) N N-M-(n-m)’ 


it easily follows that the inequality p(N, m) > p(N - 1, m) is valid if and only if N < Mn | 
Therefore, N + p(N,m) is nondecreasing if N < [| and nonincreasing for the 
remaining N. This immediately shows that the MLE is given by eq. (8.33). | 


Example 8.5.18. An unknown number of balls are in an urn. In order to estimate 
this number, we choose 50 balls from the urn and mark them. We put back the 
marked balls and mix the balls in the urn thoroughly. Then we choose another 30 
balls from the urn. If there are 7 marked among the 30, then the estimation for the 
number of balls in the urn is 214. In the case of two marked balls, the estimation 
equals 750 while in the case of 16 marked balls we estimate that there are 93 balls in 
the urn. 
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8.5.2 Unbiased Estimators 


Let us come back to the general setting. We are given a function y : © > R and look for 
a “good” estimation for y(@). If p(x) is the estimation, in most cases it will not be the 
correct value y(9). Sometimes the estimate is larger than y(@), sometimes one observes 
an x € X for which (x) is smaller than the true value. For example, if the retailer in 
Example 8.5.16 gets every week a delivery of 100 TV sets, then sometimes his estim- 
ation for the number of defective sets will be bigger than the true value, sometimes 
smaller. Since he only pays for the nondefective sets, sometimes he pays too much, 
sometimes too less. Therefore, a crucial condition for a good estimator should be that, 
on average, it meets the correct value. That is, in the long run the loss and the gain of 
the retailer should balance. In other words, the estimator should not be biased by a 
systematic error. 

In view of Proposition 7.1.29, this condition for the estimator Y may be formulated 
as follows. If @ ¢ © is the “true” parameter, then the expected value of ¥ should be y(6). 
To make this more precise’ we need the following notation. 


Definition 8.5.19. Let (7, F, Pg)g<o be a statistical model and let X : X ~ Rbe 
a random variable. We write EgX whenever the expected value of X is taken with 
respect to Pg. Similarly, in this case define 


VoX = EglX ~ EoX? 


as variance of X. Of course, we have to assume that the expected value and/or the 
variance exist. 


Remark 8.5.20. If X is discrete with values in {t, t,...}, then 


19X = > t; Pe{X = ti}. 
jal 


The case of continuous X is slightly more difficult because here we have to describe 
the density function of X with respect to Py. 


To become acquainted with Definition 8.5.19, the two following examples may be 
helpful. The first one deals with the discrete case, the second with the continuous 
one. 


7 How the expected value is defined? Note that we do not have only one probability measure, but 
many different ones. 
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Example 8.5.21. Suppose the daily number of customers in a shopping center is Pois- 
son distributed with unknown parameter A > 0. To estimate this parameter, we record 
the number of customers on n different days. Thus, the sample we obtain is a vec- 
tor k = (ky,...,kn) with kj ¢ No, where k; is the number of customers on day j. The 
describing statistical model is given by (Nj, P(NQ), Pois?”)aso with distribution Pois). 
Let X : NO > R be defined by 


3 a s 
X(I) = X(Iay «+5 kn) = = 2 k = (k,...,kn) ¢ NS. 
i: 


Which value does E,X possess? 

Answer: If we choose Pois?” as probability measure, then all X;s defined by 
Xj(ki,...,kn) := kj are Pois,-distributed (and independent, but this is not needed 
here). Note that X; is nothing else as the number of customers at day j. Hence, by 
Proposition 5.1.16, the expected value of X; is A, and since X = ; pa Xj, we finally 


obtain 


1 1 
4X =E,| — >_X; = 1AXj= =A. 


Example 8.5.22. Take 
(R", B(R"), Ny, ON certian 


as the statistical model. Thus, the parameter is of the form (1, 0”) for some p ¢ R and 
o* > 0. Define X : R" > R by X(x) = x. If the underlying measure is V(y, 0)®", then® 
X is N(u, o7/n)-distributed. Consequently, in view of Propositions 5.1.34 and 5.2.27 we 
obtain 


a 
yoeX =H and V, 2X = a 


Using the notation introduced in Definition 8.5.19, the above-mentioned requirement 
for “good” estimators may now be formulated more precisely. 


Definition 8.5.23. Anestimator y : Y > Ris said to be an unbiased estimator for 
y: © > R provided that for each 6 « O 


Egly|< co and Egy=y(6). 


8 Compare the first part of the proof of Proposition 8.4.3. 


348 —— 8 Mathematical Statistics 


Remark 8.5.24. In view of Proposition 7.1.29, an estimator y is unbiased if it possesses 
the following property: observe N independent samples x!,..., x of a statistical ex- 
periment. Suppose that 6 ¢ @ is the “true” parameter (according to which the x/s are 
distributed). Then 


ce 
i _ p(x?) = = 
P mW IO y(@)} =1. 


Thus, on average, the estimator y meets approximately the correct value. 


Example 8.5.25. Let us investigate whether the estimator in Example 8.5.12 is un- 
biased. The statistical model is (7, P(Y), Bn,alococi, Where ¥ = {0,...,n} and the 
estimator 0 acts as 


A k 
Ok)=-, k=O,...,n. 
n 
Setting Z :=n 6, then Z is the identity on 1’, hence B,,9-distributed. Proposition 5.1.13 
implies EgZ = n 8, thus, 


190 = 19 (Z/n) = ieZ/n =0@. (8.34) 


Equation (8.34) holds for all 6 ¢ [0, 1], that is, 6 is an unbiased estimator for 0. 


Example 8.5.26. Next we come back to the problem presented in Example 8.5.21. The 
number of customers per day is Pois,-distributed with an unknown parameter A > 0. 
The data of n days are combined in a vector k=(k,...,kn) € Nj. Then the parameter 
A > Ois estimated by A defined as 


ie. qt 
Ak) = Ay, ...5 kn) 2 moa 
j= 


Is this estimator for A unbiased? 
Answer: Yes, it is unbiased. Observe that A coincides with the random variable X 
investigated in Example 8.5.21. There we proved E, = A, hence, if A > 0, then we have 


BA =A. 


Example 8.5.27. Weare given certain data x1, ..., Xn, which are known to be normally 
distributed and independent, and where the expected value yp and the variance o? 
of the underlying probability measure are unknown. Thus, the describing statistical 
model is 


(R", BIR"), N(u, 07)")q1,02)c6 with O=R-x (0,0). 


8.5 Point Estimators ——— 349 


The aim is to find unbiased estimators for py and for o*. Let us begin with estimating p. 
That is, if y is defined by y(u, 07) = py, then we want to construct an unbiased estimator 
y for y. Let us take the MLE defined as 


ee 
V(X) = X= 7% x =(X1,...,Xn)- 


Due to the calculations in Example 8.5.22 we obtain 


Eo) =H = yy, 0°). 


This holds for all p and 0, hence j is an unbiased estimator for ps = y(y, 07). 
How to find a suitable estimator for 07? This time the function y has to be chosen 
as y(u, 0”) := 07. With s2 defined in eq. (8.7) set 


; he _ 
HOO) = $= —— D105 x, xeR". 


jel 


Is this an unbiased estimator for 0? ? To answer this we use property (8.10) of Pro- 
2 

position 8.4.3. It asserts that the random variable x + (n - 1) is x?_,-distributed, 

provided it is defined on (R", B(R"), V(u, o”)®"). Consequently, by Corollary 5.1.30 it 


follows that 
2 
@ jm 55] =n-1. 


Using the linearity of the expected value, we finally obtain 


7 ~ om De. oD 
41,02 Y = “1,02 Sx = 9 . 


Therefore, )(x) = s2 is an unbiased? estimator for 0°. 


Remark 8.5.28. Taking the estimator (x) = 02 in the previous example, then, in view 
of of = *1s?, it follows that 


tay =o 


Thus, the estimator f(x) = 02 is biased. But note that 


9 This explains why s? is called unbiased sample variance. 
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hence, if the sample size n is big, then this estimator is “almost” unbiased. One says in 
this case the sequence of estimators (in dependence on n) is asymptotically unbiased. 


The next example is slightly more involved, but of great interest in application. 


Example 8.5.29. The lifetime of light bulbs is supposed to be exponentially distrib- 
uted with some unknown parameter A > 0. To estimate A we switch on n light bulbs 
and record the times t,,...,t, when they burn out. Thus, the observed sample is a 
vector t = (t;,..., t,) in (0, oo)”. As estimator for A we choose 


=1/t. 


* n 
A(t) : =, é 
Is this an unbiased estimator for A? 

Answer: The statistical model describing this experiment is (R”, B(R”), BP): If 
the random variables X; are defined by X;(t) := tj, then they are independent and E)- 
distributed. Because of Proposition 4.6.13, their sum X := pare X; possesses an Erlang 
distribution with parameters n and A. An application of eq. (5.22) in Proposition 5.1.36 
for f(x) := 4 implies 


aie ay fy n ™ n-1,-Ax 
si = Ea (=) laa eae 
10) 


A change of variables s := Ax transforms the last integral into 


co 


An n2,-s4._ An An n 
aan |s eds = Mn) = nD a 
0 


This tells us that, Ais not unbiased estimator for A. But, as mentioned in Remark 8.5.28 
for 02, the sequence of estimators is asymptotically unbiased as n > oo. 


Remark 8.5.30. If we replace the estimator in Example 8.5.29 by 


A n-1 1 
A(t) := 7 
Lei fj ral ye fj 


then the previous calculations imply 


’ t=(t,...,¢n), 


Hence, from this small change we get an unbiased estimator A for A. 
Observe that the calculations in Example 8.5.29 were only valid for n > 2. Ifn = 1, 
then the expected value of A does not exist. 
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8.5.3 Risk Function 


Let (V, F, P9)gceq be a parametric statistical model. Furthermore, y : 0 > R is a func- 
tion of the parameter and y : ¥ > @ is an estimator for y. Suppose @ «€ @ is the true 
parameter and we observe some x € 1. Then, in general, we will have y(9) # (x), and 
the quadratic error |y(@) — 9(x)|* occurs. Other ways to measure the error are possible 
and useful, but we restrict ourselves to the quadratic distance. In this way we get the 
so-called loss function L : © x XY > Rofy defined by 


L(6, x) := |y(0) - 900). 


In other words, if 0 is the correct parameter and our sample is x « 4, then, using y as 
the estimator, the (quadratic) error or loss will be L(6, x). On average, the (quadratic) 
loss is evaluated by Egly(@) - j/?. 


Definition 8.5.31. The function R describing this average loss of y is said to be the 
risk function of the estimator y. It is defined by 


R(O,9) := Egly(0)-Jl?, O¢0. 


Before giving some examples of risk functions, let us rewrite R as follows. 


Proposition 8.5.32. If @ « 0, then it follows that 


R(O, 9) = |y(8) - Eoy|? + Voy. (8.35) 


Proof: The assertion is a consequence of 


R(O,9) = Eg [y(@) -9]° = Eo [((O) — Eoy) + (Eay - 9)? 
= |y(@) — Eoy|* + 2((y(@) — Eoj) Eg(Eoy — 9) + Voy. 


Because of 


ig(Eey —f) = Eep — Ey = 0, 


this implies eq. (8.35). | 


Definition 8.5.33. The function 6 + |y(@) — Egy|?, which appears in eq. (8.35), is 
said to be the bias or the systematic error of the estimator /. 


352 —— 8 Mathematical Statistics 


Corollary 8.5.34. A point estimator y is unbiased if and only if for all 0 € © its bias is 
zero. Moreover, if this is so, then its risk function is given by 


ROS) =Vey, O¢0. 


Remark 8.5.35. Another way to formulate eq. (8.35) is as follows. The risk function of 
an estimator consists of two parts. One part is the systematic error, which does not 
occur for unbiased estimators. And the second part is given by Vey. Thus, the smaller 
the bias and/or Vey become, the smaller is the risk to get a wrong estimation for y(6), 
and the better is the estimator. 


Example 8.5.36. Let us determine the risk functions for the two estimators presented 
in Example 8.5.27. The estimator for p was given by f(x) = x. Since this is an unbiased 
estimator, by Corollary 8.5.34, its risk function is computed as 


R((u, 0”), 9) = Vor - 
The random variable x + x is V(u, o7/n)-distributed, hence 


2 
R((yt, 02),9) = 
n 

There are two interesting facts about this risk function. First, it does not depend on the 
parameter py that we want to estimate. And secondly, ifn — oo, then the risk tends to 
zero. In other words, the bigger the sample size, the less becomes the risk for a wrong 
estimation. 

Next we evaluate the risk function of the estimator y(x) = s2. As we saw in Example 
8.5.29, this 7 is also an unbiased estimator for 07, hence 


R(y, 0°), 9) = Vor + 


By eq. (8.10) we know that nt sz is x?_,-distributed, hence Corollary 5.2.26 implies 
n-1 
V(u,02) E | =2 (n = 1) F 


From this one easily derives 


oo 264 
(n-1)2 n-1° 


R((u, 07), 9) = Viuo2)Sx = 2(n-1)- 


Here, the risk function depends heavily on the parameter o? that we want to estimate. 
Furthermore, if n — oo, then also in this case the risk tends to zero. 
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Example 8.5.37. Finally, regard the statistical model (, P(%), Bne)oceci, where 
X ={0,..., n}. In order to estimate 0 « [0, 1], we take, as in Example 8.5.25, the estim- 
ator 6(k) = x There it was shown that the estimator is unbiased, hence, by Corollary 
8.5.34, it follows that 


R(6,0)=V,0, O<O<1. 


If X is the identity on ¥, by Proposition 5.2.18, its variance equals VgX = nO(1 - 8). 
Since 6 = * this implies 


R(0, ) = Vo(X/n) = wo one ae 


Consequently, the risk function becomes maximal for 6 = 1/2, while for 6 = 0 or 8 = 1 
it vanishes. 


We saw in Corollary 8.5.34 that R(0, y) = Vey for unbiased y. Thus, for such estimators 
inequality (7.2) implies 


a 


x V 
Po{x eX : |y(0)-pO)| > ch < a 


that is, the smaller Vy is, the greater is the chance to estimate a value near the correct 
one. This observation leads to the following definition. 


Definition 8.5.38. Let , and y2 be two unbiased estimators for y(9). Then j is 
said to be uniformly better than 2 provided that 


Voy < Voy2 forall@<O. 


An unbiased estimator y,, is called the uniformly best estimator if it is uniformly 
better than all other unbiased estimators for y(6). 


Example 8.5.39. We observe values that, for some b > 0, are uniformly distributed 
on [0, b]. But the number b > 0 is unknown. In order to estimate it, one executes n 
independent trials and obtains as sample x = (x;,..., X;). AS point estimators for b > 0 
one may either choose 


A n+1 A pee 
bi(x) = —— maxx; or bo(x) := — so 
nN isi<n or 
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According to Problem 8.4, the estimators b, and bo are both unbiased. Furthermore, 
not too difficult calculations show that 
b? b 


_ n(n+ 2) and Vpb2 = 


Vpb : 
bY1 3n2 


Therefore, Vpbt < Vpb2 for all b > O. This tells us that b; is uniformly better than bo. 


Remark 8.5.40. A very natural question is whether there exists a lower bound for 
the precision of an estimator. In other words, are there estimators for which the risk 
function becomes arbitrarily small? The answer depends heavily on the inherent in- 
formation in the statistical model. To explain this let us come back once more to 
Example 8.5.4. 

Suppose we had Po({a}) = 1 and P,({b}) = 1. Then the occurrence of “a” would 
tell us with 100% security that @ = 0 is the correct parameter. The risk for the corres- 
ponding estimator is then zero. On the contrary, if Po({a}) = Po({b}) = 1/2, then the 
occurrence of “a” tells us nothing about the correct parameter. 


To make the previous observation more precise, we have to introduce some quantity 
that measures the information contained in a statistical model. 


Definition 8.5.41. Let (Vv, F,P9)gcq be a statistical model with log-likelihood 
function L introduced in Definition 8.5.9. For simplicity, assume © ¢ R. Then the 
function J : © > R defined by 
aL\?* 
1(@) := Eg | — 
(6) (3) 


is called the Fisher information of the model. Of course, we have to suppose that 
the derivatives and the expected value exist. 


Example 8.5.42. Let us investigate the Fisher information for the model treated in 
Example 8.5.13. There we had 


n 
2 gy = _ n 2_ 1 ; 2 
L(y, o, x) = L(y, 07, X1, ...5Xn) = ¢ 5 -Ino so dts yp. 


Fix o and take the derivative with respect to py. This leads to 


oL  mx-—np 
au 


’ 
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hence 


Recall that x is M(u, o*/n)-distributed, hence the expected value of |x — y|* is nothing 
else than the variance of x, that is, it is 0?/n. Consequently, 


HQ) = Ey,o2 (5 on o° 


a wor on 
The following result answers the above question: how precise can an estimator 
become at the most? 


Proposition 8.5.43 (Rao—Cramér-Frechet). Let (V,F,Pg)gcq be a parametric model 
for which the Fisher information I : © > R exists. If 8 is an unbiased estimator for 
0, then 


i 


Mer 5 


dc«O. (8.36) 


Remark 8.5.44. Estimators 6 that attain the lower bound in estimate (8.36) are said 
to be efficient. That is, for those estimators holds Ve = 1/I(0) for all 6 ¢ ©. In other 
words, efficient estimators possess the best possible accuracy. 

In view of Examples 8.5.36 and 8.5.42, for normally distributed populations the 
estimator ji(x) = x is an efficient estimator for y. Other efficient estimators are those 
investigated in Examples 8.5.26 and 8.5.12. On the other hand, the estimator for o? in 
Example 8.5.27 is not efficient. But it can be shown that s2 is a uniformly best estimator 
for o*, that is, there do not exist efficient estimators in this case. 


8.6 Confidence Regions and Intervals 


Point estimations provide us with a single value 6 « ©. Further work or necessary 
decisions are then based on this estimated parameter. The disadvantages of this ap- 
proach are that we have no knowledge about the precision of the obtained value. Is 
the estimated parameter far away from the true one or maybe very near? To explain 
the problem, let us come back to the situation described in Example 8.5.16. If the re- 
tailer observes 4 defective TV sets among 15 tested, then he estimates that there are 26 
defective sets in the delivery of 100. But he does not know how precise his estimation 
of 26 is. Maybe there are much more defective sets in the delivery, or maybe less than 
26. The only information he has is that the estimates are correct on average. But this 
does not say anything about the accuracy of a single estimate. 
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This disadvantage of point estimators is avoided when estimating a certain set 
of parameters, not only a single point. Then the true parameter is contained with 
great probability in this randomly chosen region. In most cases, these regions will 
be intervals of real or natural numbers. 


Definition 8.6.1. Suppose we are given a parametric statistical model 
(X,F,Pe)eco. A mapping C : X—P(0) is called an interval estimator,'© 
provided for fixed 6 « O 


{xe X:0€CxX)}eF. (8.37) 


Remark 8.6.2. Condition (8.37) is quite technical and will play no role later on. But it 
is necessary because otherwise the next definition does not make sense. 


Definition 8.6.3. Let a be a real number in (0, 1). Suppose an interval estimator 
C: X > P(@) satisfies for each 0 < © the condition 


Pe{fxe X:0¢€CW}21-a. (8.38) 


Then C is said to be a 100(1 — a)% interval estimator.!! The sets C(x) ¢ © with 
x € X are called 100(1 — a)% confidence regions or confidence intervals.” 


How does an interval estimator apply? Suppose 6 « O is the “true” parameter. In a 
statistical experiment, one obtains some sample x « 1 distributed according to Py». In 
dependence of the observed sample x, we choose a set C(x) of parameters. Then with 
probability greater than or equal to 1— a, the observed x « ¥ leads to a region C(x) that 
contains the true parameter 0. 


Remark 8.6.4. It is important to say that the region C(x) is random, not the un- 
known parameter 0 « ©. Metaphorically speaking, a fish (the true parameter 8) is 
in a pond at some fixed but unknown spot. We execute a certain statistical experiment 
to get some information about the place where the fish is situated. In dependence of 
the result of the experiment, we throw a net into the pond. Doing so, we know that 
with probability greater than or equal to 1 — a, the result of the experiment leads to a 
net that catches the fish. In other words, the position of the fish is not random, it is the 
observed sample, hence also the thrown net. 


10 Better notation would be “region estimator” because C(x) ¢ © may be an arbitrary subset, not 
necessarily an interval, but “interval estimator” is commonly accepted, therefore, we use it here also. 
11 Also 1 - a estimator. 

12 Also 1- a confidence regions or intervals. 
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Remark 8.6.5. It is quite self-evident that one should try to choose the confidence re- 
gions as small as possible, without violating condition (8.38). If we are not interested 
in “small” confidence regions, then we could always chose C(x) = @. This is not 
forbidden, but completely useless. 


Construction of confidence regions: For a better understanding of the subsequent con- 
struction, let us shortly recall the main assertions about hypothesis tests from a 
slightly different point of view. 

Let (V, F, Pg)9<9 be a statistical model. We choose a fixed, but arbitrary, 0 « O. 
With this chosen 0, we formulate the null hypothesis as Hp : 9 = @. The alternative 
hypothesis is then H, : 9 # 0. Let T = (%o, 4%) be an a-significance test for Hp against 
H,. Because the hypothesis, hence also the test, depends on the chosen 9 €« 0, we 
denote the null hypothesis by Ho(@) and write T(@) = (%o(6), %;(0)) for the test. That 
is, Ho(@) : 9 = 6 and T(6) is an a-significance test for H[p(@). With this notation set 


C(x) = {86 O:xe X(O)}. (8.39) 


Example 8.6.6. Choose the hypothesis and the test as in Proposition 8.4.12. The stat- 
istical model is then given by (R", B(R"), V(v, 06)®")ver, where this time we denote 
the unknown expected value by v. For some fixed, but arbitrary, p € R let 


Ho@:v=p and Hy@):v#p. 


The a-significance test T(z) constructed in Proposition 8.4.12 possesses the region of 
acceptance 


mo(n) = {xeR": Vi ee 


a S 2a} . 


Thus, in this case, the set C(x) in eq. (8.39) consists of those p € R that satisfy the 
estimate /n | < Z_qj2- That is, given x « R", then C(x) is the interval 


00 


Z4- . 
Jn 1-ai| 


-[x- a apd 

CQ) = [x Ta Zar» X+ 
Let us come back to the general situation. The statistical model is (V7, F, Pg)9<9. Given 
6 « O let T(6) be an a-significance test for Ho(@) against H, (9) where Ho(@) is the hypo- 
thesis Ho(@) : 9 = 0. Given x « &¥ define C(x) € © by eq. (8.39). Then the following is 
valid. 


Proposition 8.6.7. Let T(0) be as above an a-significance test for Ho(0) against HI,(6). 
Define C(x) by eq. (8.39) where Xo(8) denotes the region of acceptance of T(@). Then 
the mapping x ~ C(x) from X into P(Q) is a 100(1 — a)% interval estimator. Hence, 
{C(x) : x « X}is a collection of 100(1 — a)% confidence regions. 
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Proof: By assumption, T(@) is an a-significance test for Ho(0). The definition of those 
tests tells us that 

Pe(%(@))< a, hence Po(X%(0))>1-a. 


Given 6 « O and x « XX, by the construction of C(x), one has 0 € C(x) if and only if 
x € X(0). Combining these two observations, given 6 «< 0, then it follows that 


Pe{xe ¥:0€ C(xX)} =Pofxe X :x € Xo(6)} = Po(X0(0)) >1-a. 


This completes the proof. o 


Example 8.6.8. Proposition 8.6.7 implies that the intervals C(x) in Example 8.6.6 are 
1 — a confidence intervals. That is, if we execute an experiment or if we analyze 
some data, then with probability greater than or equal to 1— a we will observe values 
X= (%1,...,%n) such that the “true” parameter p satisfies 


00 2 00 
X-— = Zy-g2 SMS X+ —=Z-gp- 
/n /n 


For example, if we choose a = 0.05 and observe the nine values 

10.1, 9.2, 10.2, 10.3, 10.1, 9.9, 10.0, 9.7, 9.8, 
then x = 9.9222. The variance do is not known, therefore we use its estimation by s2, 
that is, we take do as Sy = 0.330824. Because Of Z1_9/7 = Zo.975 = 1.95996, with security 


of 95% we finally get 


9.7061 < p < 10.1384. 


In the next example we describe the confidence intervals generated by the t-test 
treated in Proposition 8.4.12. 


Example 8.6.9. The statistical model is 
(R", B(R"), N\v, OPP icietoas) ‘ 


By Proposition 8.4.12 an a-significance test T(u) is given by the region of acceptance 


ro(n) = [xeR": vi = 


Sx 


yu 
< tr-ss-aa| . 
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From this one easily derives 


x = 
CO) = {u eR: Jn E < tn-11-ar} 
Sx 


Sx 


2 Sx - 
= |X -— th ptan, X+ — ttt. \: 
| vn n-1;1-a/2 Jn n-1;1-a/2 


Let us explain the result by the concrete sample investigated in Example 8.4.15. There 
we had x = 22.072, sy = 0.07554248, and n = 10. For a = 0.05, the quantile of tg equals 
to,0.975 = 2, 26. From this we derive [22.016 , 22.126] as 95% confidence interval. 

Verbally this says with a security of 95% we observed those x,..., X19 for which 
ue C(x) = [22.016, 22.126]. 


The next example shows how Proposition 8.6.7 applies in the case of discrete probab- 
ility measures. 


Example 8.6.10. The statistical model is (7, P(Y), Bn,g)o<e<1 Where the sample space 
is Y = {0,...,n}. Our aim is to construct confidence regions C(k) € [0, 1], k = 0,...,n, 
such that 


Brolk <n: 0¢ C(}>1-a. 


In order to get these confidence regions, we use Proposition 8.6.7. As shown in Pro- 
position 8.3.1, the region of acceptance %o(8) of an a-significance test T(@), where 
Ho : 9 = @, is given by 


Xo(8) = {no(8), ...,71(8)}. 
Here, the numbers no(9) and n;(8) were defined by 
us n\ .. ‘ 
no(@) :=min{k<n: PS ( "ei -6)"7 > a/2 
fo V 
and 


n(0) := max{k<n: > (jo - 6)" > a/2 
jek 


Applying Proposition 8.6.7, the sets 


C(k) := {6 € [0,1] : k € Xo(6)} = {0 € [0,1] : nO) <k <m(O)}, k=0,...,n, 
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are 1 — a confidence regions. By the definition of no(9) and of n;(@), given k < n, thena 
number 6 « [0, 1] satisfies no(@) < k < m,(6) if and only if at the same time 


k 

Bn o({0, ...,k}) = > (joa -6)"J >a/2 and 
j=0 

Bnal{k,....n}) = > (‘Jo — oy" > a2. 


jek 


In other words, observing k < n, then the corresponding 1 — a confidence region is 
given by 


C(k) = {8 : Bno({0,...,k}) > a/2} n {0 : Brol{k, ...,n}) > a/2}. (8.40) 


These sets are called the 100(1 - a)% Clopper-—Pearson intervals or also exact 
confidence intervals for the binomial distribution. 

Let us consider the following concrete example. In an urn are white and black 
balls with an unknown proportion 6 of white balls. In order to get some information 
about 0, we choose randomly 500 balls with replacement. Say, 220 of the chosen balls 
are white. What is the 90% confidence interval for 0 based on this observation? 

Answer: We have n = 500 and the observed k equals 220. Consequently, the 
confidence interval C(220) consists of those 0 ¢ [0, 1] for which at the same time 


f(@) >a/2=0.05 and g(6)>a/2=0.05, 


where 
220 see 
f(@) = = (Gala —9)5°°F and g(6):= - (°°) gi(a — 6)°0071, 
j=0 j=220 


Numerical calculations tell us that 
f (0.4777) = 0.0500352 ~ 0.05 aswellas (0.4028) = 0.0498975 x 0.05. 
Therefore, a 90% confidence interval C(220) is given by 
C(220) = (0.4028, 0.4777). 


For n = 1000 and 440 observed white balls similar calculations lead to the smaller, 
hence more significant, interval 


C(440) = (0.4139, 0.4664) . 
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Remark 8.6.11. The previous example already indicates that the determination of 
the Clopper—Pearson intervals becomes quite complicated for large n. Therefore, one 
looks for “approximative” intervals. Background for the construction is the central 
limit theorem in the form presented in Proposition 7.2.15. For S,s distributed according 
to By ¢ it implies 


Sn — nd 


Jna(1 — @) 


lim P| evan} =1-8 
n-oo 


or, equivalently, 


k-né 
/no(1 — 6) 


lim Bro ( <n: 


n- oo 


| seal =1-a. 


Here Z;_4/2 are the quantiles introduced in Definition 8.4.5. Thus, an “approximative” 
region of acceptance, testing the hypothesis “the unknown parameter is 0,” is given 


by 
@(1 - 
< Z1-a/2\/ c | . (8.41) 


An application of Proposition 8.6.7 leads to certain confidence regions, but these are 
not very useful. Due to the term ,/6(1 — 8) on the right-hand side of eq. (8.41), it is not 
possible, for a given k < n, to describe explicitly those 6s for which k € %X(6). To 
overcome this difficulty, we change %(@) another time by replacing @ on the right- 
hand side by its MLE @(k) = K That is, we replace eq. (8.41) by 


k 
~_9 
n 


Xo(8) = {ken 


ka-B 


n 


X(6) = 4k<n: 


k 
—-O<Z_ 
a 1-a/2 


Doing so, an application of Proposition 8.6.7 leads to the “approximative” confidence 
intervals C(k), k = 0,...,n, defined as 


7 k kq-k) k le 
C(k) =|] — —21-a/2 / nt n) > ~ +21-a/2 all= i) . (8.42) 
n n n n 


Example 8.6.12. We investigate once more Example 8.6.10. Among 500 chosen balls 
we observed 220 white ones. This observation led to the “exact” 90% confidence 
interval C(220) = (0.4028, 0.4777). 

Let us compare this result with the interval we get by using the approximative 
approach. Since the quantile Z,_,/) for a = 0.1 equals Zo.95 = 1.64485, the left and the 
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right endpoints of the interval (8.42) with k = 220 are evaluated by 


220 220 - 280 

—— — 1.64485 - ,/ ———— = 0.4035 and 
500 5003 

220 /220 - 280 

—— +1.64485 - a = 0.4765. 

500 5003 


Thus, the “approximative” 90% confidence interval is C(220) = (0.4035, 0.4765), which 
does not defer very much from C(220) = (0.4028, 0.4777). 

In the case of 1000 trials and 440 white balls, the endpoints of a confidence 
interval are evaluated by 


440 440 - 560 
——— — 1,64485. ,/ ————— = 0.414181 and 
1000 10003 


0) 0-560 
ci + 1.64485 - alle = 0.4645819. 
1000 10003 


That is, C(440) = (0.4142, 0.4659) compared with C(440) = (0.4139, 0.4664). 


Example 8.6.13. A few days before an election 1000 randomly chosen people are 
questioned for whom they will vote next week, either candidate A or candidate B. 540 
of the interviewed people answered that they will vote for candidate A, the remaining 
460 favor candidate B. Find a 90% sure confidence interval for the expected result of 
candidate A in the election. 

Solution: We have n = 1000, k = 540, and a = 0.1 The quantile of level 0.95 of the 
standard normal distribution equals 2095, = 1.64485 (compare Example 8.6.12). This 
leads to [0.514, 0.566] as “approximative” 90% confidence interval for the expected 
result of candidate A. 

If one questions another 1000 randomly chosen people, another confidence inter- 
val will occur. But, on average, in 9 of 10 cases a questioning of 1000 people will lead 
to an interval containing the correct value. 


8.7 Problems 


Problem 8.1. For some b > 0 let Py be the uniform distribution on [0, b]. The precise 
value of b > O is unknown. We claim that b < bo for a certain bo > O. Thus, the 
hypotheses are 


Ho: b<bo and H,:b>bo. 
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To test Ho, we chose randomly n numbers x;,...,X, distributed according to Pp. 
Suppose the region of acceptance % of a hypothesis test T, is given by 


Xo = {04,...,X%n) : max x; < c} 
1si<n 


for some c > 0. 

1. Determine those c > 0 for which T, is an a-significance test of level a < 1. 

2. Suppose T, is an a-significance test. For which of those c > 0 does the probability 
for a type II error become minimal? 

3. Determine the power function of the a-test T, that minimizes the probability of the 
occurrence of a type II error. 


Problem 8.2. For 9 > 0 let Pg be the probability measure with density pg defined by 


659-1: 5 € (0,1) 
Pols) = | 0 : otherwise 
Check whether the pgs are probability density functions. 
2. In order to get information about @ we execute n independent trials according to 
Pg. Which statistical model describes this experiment? 
3. Find the maximum likelihood estimator for 0. 


Problem 8.3. The lifetime of light bulbs is exponentially distributed with unknown 
parameter A > O. In order to determine A we switch on n light bulbs and record the 
number of light bulbs that burn out until a certain time T > 0. Determine a statistical 
model that describes this experiment. Find the MLE for A. 


Problem 8.4. Consider the statistical model in Example 8.5.39, that is, 
(R", B(R"), PP" )b>0 with uniform distribution P, on [0, b]. There are two natural 
estimators for b > 0, namely b, and bp defined by 


P ‘ pipe 
b,x) = —— maxx; and b)(x) := — x x =(X%41,...,Xn) € R". 
n= is<i<n n m1 
Prove that b, and b> possess the following properties. 
1. The estimators b, and bo are unbiased. 
2. One has 
7 b2 z b2 
Vpbi = ———~ and Vpbo.= —,. 

us n(n + 2) ee ae 
Problem 8.5. In a questioning of 2000 randomly chosen people 1420 answered that 
they regularly use the Internet. Find an “approximative” 90% confidence interval for 
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the proportion of people using the Internet regularly. Determine the inequalities that 
describe the exact intervals in eq. (8.40). 


Problem 8.6. How do the confidence intervals in eq. (8.40) look like for k = 0 or k = n? 


Problem 8.7. Suppose the statistical model is (4,P(4),Hnn)men with 
xX ={0,...,n} and with the hypergeometric distributions Hy,y» introduced in 
Definition 1.4.25. 

1. Forsome Mo < M the hypotheses are Ho : M = Mo against H, : M # Mo. Find 
(optimal) numbers 0 < mo < m, < nsuch that X% = {mo,..., 7m} is the region of 
acceptance of an a-significance test T for Hp against Hy. 

Hint: Modify the methods developed in Proposition 8.2.19 and compare the 
construction of two-sided tests for a binomial distributed population. 

2. Use Proposition 8.6.7 to derive from %o confidence intervals C(m), 0 < m < n, of 
level a for the unknown parameter M. 

Hint: Follow the methods in Example 8.6.10 for the binomial distribution. 


Problem 8.8. Use Proposition 8.6.7 to derive from Propositions 8.4.17 and 8.4.18 
confidence intervals for the unknown variance of a normal distributed population. 


A Appendix 


A.1 Notations 


Throughout the book we use the following standard notations: 

1. The natural numbers starting at 1 are always denoted by N. In the case 0 is 
included we write No. 

2. As usual the integers Z are given by Z = {...,-2, -1,0,1,2,...}. 

3. By R we denote the field of real numbers endowed with the usual algebraic op- 
erations and its natural order. The subset Q c R is the union of all rational 
numbers, that is, of numbers m/n where m,n «¢ Zand n #0. 

4. Given n > 1 let R" be the n-dimensional Euclidean vector space, that is, 


R" = {x = (4,...,Xn) 247 € R}. 
Addition and scalar multiplication in R" are carried out coordinate-wise, 
X+y=(%,--..Xn) + Vu -- es Vn) = O41 + - Xn + Vn) 
and if a < R, then 
ax = (AX%4,...,QAXy). 


A.2 Elements of Set Theory 


Given a set M its powerset P(M) consists of all subsets of M. In the case that M is finite 
we have #(P(M)) = 2), where #(A) denotes the cardinality (number of elements) of 
a finite set A. 

If A and Bare subsets of M, written as A, B ¢ M or alsoas A, B <« P(M), their union 
and their intersection are, as usual, defined by 


AUB={xeM:xcAorxe BhandAnB={xeM:xcAandxeB}. 
Of course, it always holds that 
ANBCACAUB and ANBCBCAUB. 


In the same way, given subsets A), A2,... of M their union Us, Aj and their intersec- 
tion ae A; is the set of those x « M that belong to at least one of the A; or that belong 
to all Aj, respectively. 
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Quite often we use the distributive law for intersection and union. This asserts 
An|{JB} =J@nB). 
jel jel 


Two sets A and Bare said to be disjoint! provided that A n B = . A sequence of sets 
Aj, Ap, ... is called disjoint? whenever A; 1 A; = @ ifi # j. 

An element x ¢ M belongs to the set difference A\B provided that x < A but x ¢ B. 
Using the notion of the complementary set B° := {x « M : x ¢ B}, the set difference 
may also be written as 


A\B=AnB‘. 
Another useful identity is 
A\B = A\(AnB). 


Conversely, the complementary set may be represented as the set difference BS = M\B. 
We still mention the obvious (B‘°)° = B. 
Finally we introduce the symmetric difference AAB of two sets A and B as 


AAB := (A\B) u (B\A) = (An BB‘) u (Bn AS) = (AU B)\(AnB). 


Note that an element x ¢ M belongs to AAB if and only if x belongs exactly to one of 
the sets A or B. 
De Morgan’s rules are very important and assert the following: 


Given sets A;,..., Ay their Cartesian product A, x - - - x A, is defined by 
Ay x-+-xAni={(a1,..., Qn): aj € Aj}. 
Note that #(A, x «+: x Ay) = #(A)) --- #(Ay). 
Let S be another set, for example, S = R, and let f : M ~ S be some mapping from 


M to S. Given a subset B ¢ S, we denote the preimage of B with respect to f by 


f-\(B) = {xe M: fd) « B}. (A.1) 


1 Sometimes called “mutually exclusive.” 
2 More precisely, one should say “pairwise disjoint.” 
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In other words, an element x ¢ M belongs to f~!(B) if and only if its image with respect 
tof is an element of B. 
We summarize some crucial properties of the preimage in a proposition. 


Proposition A.2.1. Let f : M + S be a mapping from M into another set S. 
(1) f-1(@) = @ and f-\(S) = M. 
(2) For any subsets B; ¢ S the following equalities are valid: 


fF" (JB) = UP) and f* | (3) | = (fF). (A.2) 


jzi jzl jzl j2l 


Proof: We only prove the left-hand equality in eq. (A.2). The right-hand one is proved 
by the same methods. Furthermore, assertion (1) follows immediately. 


Take x ¢ f | Ujst Bj). This happens if and only if 


fod «(JB (A.3) 


jel 


is satisfied. But this is equivalent to the existence of a certain jo > 1 with f(x) € Bip. 
By definition of the preimage the last statement may be reformulated as follows: there 
exists ajo > 1 such that x ¢ f “Bal, But this implies 


xe )f'(B). (A.4) 
j2l 


Consequently, an element x ¢ M satisfies condition (A.3) if and only if property (A.4) 
holds. This proves the left-hand identity in formulas (A.2). a 


A.3 Combinatorics 


A.3.1 Binomial Coefficients 


A one-to-one mapping 7 from {1,...,n} to {1,...,n} is called a permutation (of or- 
der n). Any permutation reorders the numbers from 1 to n as 7(1), 7(2),..., a(n) and, 
vice versa, each reordering of these numbers generates a permutation. One way to 


write a permutations is 
1 2 ays n 
= 
m1) m(2) ... 2m(n) 


123\. , 
For example, ifn = 3, then a = 33 _ is equivalent to the order 2, 3,1 or to 


m(1) = 2, 2(2) =3 and 7(3) =1. 
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Let S, be the set of all permutations of order n. Then one may ask for #(S;,) or, 
equivalently, for the number of possible orderings of the numbers {1, ... , n}. 
To treat this problem we need the following definition. 


Definition A.3.1. For n < N we define n-factorial by setting 
nl!=1-2---(n-1)-n 


Furthermore, let O! = 1. 


Now we may answer the question about the cardinality of S). 


Proposition A.3.2. We have 
#(S,) =n! (A.5) 
or, equivalently, there are n! different ways to order n distinguishable objects. 


Proof: The proof is done by induction over n. If n = 1 then #(S,) = 1 = 1! and eq. (A.5) 
is valid. 

Now suppose that eq. (A.5) is true for n. In order to prove eq. (A.5) for n+1 we split 
Sn+1 as follows: 


n+1 


Sn = U Ak, 
k=1 


where 
Ap={m¢Sninn+D=k}, k=1,...,n+1. 


Each m ¢« A, generates a one-to-one mapping 7 from {1,...,n} onto the set 
{1,...,k-1,k+1,...,n}by letting 7(j) = 7(), 1 <j <n. Vice versa, each such 7 defines 
a permutation 7 € A, by setting m(j) = 7(/), j < n, and m(n + 1) = k. Consequently, since 
eq. (A.5) holds for n we get #(A,) = n!. Furthermore, the Axs are disjoint, and 


n+1 


#(Snu) = > #(A,) = (n+1)-n! = (n+1)!, 
k=1 


hence eq. (A.5) also holds for n + 1. This completes the proof. a 


Next we treat a tightly related problem. Say we have n different objects and we want to 
distribute them into two disjoint groups, one having k elements, the other n—k. Hereby 
it is of no interest in which order the elements are distributed, only the composition of 
the two sets matters. 
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Example A.3.3. There are 52 cards in a deck that are distributed to two players, so that 
each of them gets 26 cards. For this game it is only important which cards each player 
has, not in which order the cards were received. Here n = 52andk =n-k = 26. 


The main question is: how many ways can n elements be distributed, say the numbers 
from 1 to n, into one group of k elements and into another of n — k elements? In 
the above example, that is how many ways can 52 cards be distributed into two 
groups of 26. 

To answer this question we use the following auxiliary model. Let us take any 
permutation 7 € S,. We place the numbers 7(1), ... , 7(k) into group 1 and the remain- 
ing m(k + 1),...,7(n) into group 2. In this way we obtain all possible distributions but 
many of them appear several times. Say two permutations 7, and 7) are equivalent if 
(as sets) 


{m(1), sey mk} = {m(1), sey 71y(k)} . 
Of course, this also implies 
{m(k+,...,m(n} = {m(k+1),...,m(n}, 


and two permutations generate the same partition if and only if they are equivalent. 
Equivalent permutations are achieved by taking one fixed permutation 7, then per- 
muting {7(1), ...,7(k)} and also {m(k + 1),...,(n)}. Consequently, there are exactly 
k\(n — k)! permutations that are equivalent to a given one. Summing up, we get that 
there are _™ .. different classes of equivalent permutations. Setting 


K\(n-k)! 
n\ _ n! 
k) K(n-b! 


we see the following. 


There are (;,) different ways to distribute n objects into one group of k and into another 
one of n- k elements. 


The numbers (;) are called binomial coefficients, read “n chosen k.” We let (;) = 0 


in case ofk > nork <0. 


Example A.3.4. A digital word of length n consists of n zeroes or ones. Since at every 
position we may have either 0 or 1, there are 2” different words of length n. How many 
of these words possess exactly k ones or, equivalently, n- k zeroes? To answer this put 
all positions where there is a “1” into a first group and those where there is a “O” intoa 
second one. In this way the numbers from 1 to n are divided into two different groups 
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of size k and n - k, respectively. But we already know how many such partitions exist, 
namely (;). As a consequence we get 


There are (2) words of length n possessing exactly k ones and n - k zeroes. 


The next proposition summarizes some crucial properties of binomial coefficients. 


Proposition A.3.5. Let n be a natural number, k = 0,...,n and let r > 0 be an integer. 
Then the following equations hold: 


({) 7 (x) (A.6) 
) - i ‘) + ae and (Az) 


CEO DV ECT) ae 


j-0 j-0 


Proof: Equations (A.6) and (A.7) follow immediately by the definition of the binomial 
coefficients. Note that eq. (A.7) also holds if k = n because we agreed that (", ‘) =0. 
An iteration of eq. (A.7) leads to 


a) 


Replacing in the last equation n by n + r as well as k by n we obtain the left-hand 
identity (A.8). The right-hand equation follows by inverting the summation, that is, 
one replaces j by n — j. o 


Remark A.3.6. Equation (A.7) allows a graphical interpretation by Pascal’s triangle. 


The coefficient (7) in the nth row follows by summing the two values (j;) and (",’) 
above (;) in the (n — 1)th row. 


Next we state and prove the important binomial theorem. 
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Proposition A.3.7 (Binomial theorem). For real numbers a and b and any né No, 


(a+ b)" = y (1) ra sia (A.9) 


k=0 


Proof: The binomial theorem is proved by induction over n. If n = 0, then eq. (A.9) 
holds trivially. 

Suppose now that eq. (A.9) has been proven for n - 1. Our aim is to verify that it is 
also true for n. Using that the expansion holds for n — 1 we get 


(a+ b)" =(a+b)"™ (a + b) 


k=0 
n-2 n-1 n-1 ad 
=qr+ K+] pn-1-k + p+ k pn-k 
~ Ce) ne 


where we used eq. (A.7) in the last step . | 


The following property of binomial coefficients plays an important role when in- 
troducing the hypergeometric distribution (compare Proposition 1.4.24). It is also 
used during the investigation of sums of independent binomial distributed random 
variables (compare Proposition 4.6.1). 


Proposition A.3.8 (Vandermonde’s identity). Ifk, m, and nin No, then 


= ()()-(3") ~~ 


j=0 


Proof: An application of the binomial theorem leads to 


n+m — Nn+M)\ % 
(1+ x) =») x", xeR. (A.11) 


k 
k=0 
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On the other hand, another use of Proposition A.3.7 implies? 


(144+x)™™ = (14+ x)"(1+ x)™ 


e (ie [E(e] LEC 


i-0 i=0 j-0 i-0 

n+m n m n+m k i wm 
= Ale) a rie eae x (A.12) 
k=0 | itj=k k=0 | j=0 


The coefficients in an expansion of a polynomial are unique. Hence, in view of eqs. 
(A.11) and (A.12), we get for all k < m+n the identity 


()-EO(e)- 


Hereby note that both sides of eq. (A.10) become zero whenever k > n+ m. This 
completes the proof. o 


Our next objective is to generalize the binomial coefficients. In view of 


ny  nit=1)ss2(n—k+)) 
(1) k! 


for k > 1and n € N the generalized binomial coefficient is introduced as 


(A.13) 


mn). ten) -s+ ake) 
(n): k! : 


The next lemma shows the tight relation between generalized and “ordinary” bino- 
mial coefficients. 


Lemma A.3.9. Fork >1andnéN, 
—n zp (ntk-1 
=(-1 ; 
a *( i) 


3 When passing from line 2 to line 3 the order of summation is changed. One no longer sums over the 
rectangle [0, m] x [0, n]. Instead one sums along the diagonals, where i + j = k. 
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Proof: By definition of the generalized binomial coefficient we obtain 


Wy wien t) Rake) 
(7) k! 

yntk=D) (Mek =2)++- Qe Tn _ x(n+k-1 

(-1) co"( : ). 


This completes the proof. a 


For example, Lemma A.3.9 implies (YD) = (-1)* and (7) = -n. 


A.3.2 Drawing Balls out of an Urn 


Assume that there are n balls labeled from 1 to n in an urn. We draw k balls out of 
the urn, thus observing a sequence of length k with entries from {1,...,n}. How many 
different results (sequences) may be observed? To answer this question we have to 
decide the arrangement of drawing. Do we or do we not replace the chosen ball? Is 
it important in which order the balls were chosen or is it only of importance which 
balls were chosen at all? Thus, we see that there are four different ways to answer this 
question (replacement or nonreplacement, recording the order or nonrecording). 


Example A.3.10. Let us regard the drawing of two balls out of four, that is, n = 4 
and k = 2. Depending on the different arrangements the following results may be 
observed. Note, for example, that in the two latter cases (3, 2) does not appear because 
it is identical to (2, 3). 


Replacement and order is Nonreplacement and order is 
important important 
Qo 4,2) G3) 4) : (1,2) (4,3) (4) 
(2,1) (2,2) (2,3) (2,4) (2,1) - (2,3) (2,4) 
31) ©@2) ©3) @G,4) 3.) (3,2) - (3, 4) 
4) 42 43) 44 (4,1) (42) 4,3) 
16 different results 12 different results 
Replacement and order is not Nonreplacement and order 
important is not important 
Qo 4,2) G3) G4) - (2) (3) (4) 

: (4, 4) . 


10 different results 6 different results 
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Let us come back now to the general situation of n different balls from which we 
choose k at random. 

Case 1: Drawing with replacement and taking the order into account. 

We have n different possibilities for the choice of the first ball and since the chosen 
ball is placed back there are also n possibilities for the second one and so on. Thus, 
there are n possibilities for each of the k balls, leading to the following result. 


The number of different results in this case is n* 


Example A.3.11. Letters in Braille, a scripture for blind people, are generated by dots 
or nondots at six different positions. How many letters may be generated in that way? 
Answer: It holds that n = 2 (dot or no dot) at k = 6 different positions. Hence, 
the number of possible representable letters is 2° = 64. In fact, there are only 63 
possibilities because we have to rule out the case of no dots at all 6 positions. 


Case 2: Drawing without replacement and taking the order into account. 

This case only makes sense if k < n. There are n possibilities to choose the first ball. 
After that there are still n — 1 balls in the urn. Hence there are only n — 1 possibilities 
for the second choice, n - 2 for the third, and so on. Summing up we get the following. 


The number of possible results in this case equals 


al 
MaDe ke 7a 


Example A.3.12. In a lottery 6 numbers are chosen out of 49. Of course, the chosen 
numbers are not replaced. If we record the numbers as they appear (not putting them 
in order) how many different sequences of six numbers exist? 

Answer: Here we have n = 49 and k = 6. Hence the wanted number equals 


49! 


—— = 49. . -44 = 10, 068, 347, 520 
43! 


Case 3 : Drawing with replacement not taking the order into account. 

This case is more complicated and requires a different point of view. We count how 
often each of the n balls was chosen during the k trials. Let k; > 0 be the frequency 
of the first ball, ky > O that of the second one, and so on. In this way we obtain n 
non-negative integers k,,..., kn satisfying 


kyt+---+hk, =k. 
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Indeed, since we choose k balls, the frequencies have to sum to k. Consequently, the 
number of possible results when drawing k of n balls with replacement and not taking 
the order into account coincides with 


#{(k,...,kn),kj¢ No, a+: -- +h =k}. (A.14) 


In order to determine the cardinality (A.14) we use the following auxiliary model: 

Let B,,..., Bn be n boxes. Given n nonnegative integers k,,...,k,, summing to k, 
we place exactly k;, dots into B,, kz dots into Bz, and so on. At the end we distributed 
k nondistinguishable dots into n different boxes. Thus, we see that the value of (A.14) 
coincides with the number of different possibilities to distribute k nondistinguishable 
dots into n boxes. Now assume that the boxes are glued together; on the very left we 
put box B;, on its right we put box B, and continue in this way up to box B, on the very 
right. In this way we obtain n + 1 dividing walls, two outer and n - 1 inner ones. Now 
we get all possible distributions of k dots into n boxes by shuffling the k dots and the 
n —1inner dividing walls. For example, if we get the order w, w, d,d,w..., then this 
means that there are no dots in B,; and Bo, but there are two dots in B3. 

Summing up, we have N = n+ k-—1 objects, k of them are dots and n - 1 are walls. 
As we know there are Q) different ways to order these N objects. Hence we arrived at 
the following result. 


The number of possibilities to distribute k anonymous dots into n boxes equals 


a - eee) 
k Kh goa f° 
It coincides with #{(k1,..., Kn), kj € No, ki +: --+kn = k}as well as with the number of 


different results when choosing k balls out of n with replacement and not taking order 
into account. 


Example A.3.13. Dominoes are marked on each half either with no dots, one dot or 
up to six dots. Hereby the dominoes are symmetric, that is, a tile with three dots on 
the left-hand side and two ones on the right-hand one is identical with one having two 
dots on the left-hand side and three dots on the right-hand one. How many different 
dominoes exist? 

Answer: It holds n = 7 and k = 2, hence the number of different dominoes equals 


Cr) Cem 


Case 4: Drawing without replacement not taking the order into account. 
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Here we also have to assume k < n. We already investigated this case when we 
introduced the binomial coefficients. The k chosen numbers are put in group 1, the 
remaining n — k balls in group 2. As we know there are (D) ways to split the n numbers 
into such two groups. Hence we obtained the following. 


The number of different results in this case is (;,) 


Example A.3.14. If the order of the six numbers is not taken into account in Example 
A.3.12, that is, we ignore which number was chosen first, which second, and so on the 
number of possible results equals 


49\  49---43 
= —__™ = 13, 983, 816 
(‘%) 6! 


Let us summarize the four different cases in a table. Here O and NO stand for re- 
cording or nonrecording of the order while R and NR represent replacement or 
nonreplacement. 


Nol("')| Co) 


A.3.3 Multinomial Coefficients 


The binomial coefficient (;) describes the number of possibilities to distribute n ob- 
jects into two groups of k and n — k elements. What happens if we have not only two 
groups but m > 2? Say the first group has k; elements, the second has kz elements, and 
so on, up to the mth group that has k,, elements. Of course, if we distribute n elements 
the k; have to satisfy 


ky+---+ky=n. 
Using exactly the same arguments as in the case where m = 2 we get the following. 


There exists exactly moist different ways to distribute n elements into m groups of 
ah Rm 


sizes ky, ko,...,km where ky +- --+km =n. 


In accordance with the binomial coefficient we write 


( n a n! ne Dees a 
ky,...,km Tabs Kl? 1 m=", : 


i kn) a multinomial coefficient, read “n chosen k, up to ky.” 


and call (i. 
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Remark A.3.15. If m = 2, then k, + ky =n, and 


(ints) - es \) - @ - (7,)- 


Example A.3.16. A deck of cards for playing skat consists of 32 cards. Three players 
each gets 10 cards; the remaining two cards (called “skat”) are placed on the table. 
How many different distributions of the cards exist? 

Answer: Let us first define what it means for two distribution of cards to be 
identical. Say, this happens if each of the three players has exactly the same cards as in 
the previous game. Therefore, the remaining two cards on the table are also identical. 
Hence we distribute 32 cards into 4 groups possessing 10,10,10, and 2 elements. 
Consequently, the number of different distributions equals + 


= = ) 753294409 x 10 3 


Remark A.3.17. One may also look at multinomial coefficients from a different point 
of view. Suppose we are given n balls of m different colors. Say there are k; balls of 
color 1, kz balls of color 2, up to kj» balls of color m where, of course, ky + --- + km =n. 


Then there exist 
n 
Kiveccakn 


different ways to order these n balls. This is followed by the same arguments as we 
used in Example A.3.4 for m = 2. 
For instance, given 3 blue, 4 red and 2 white balls, then there are 


! 
7 = S = 1260 
3, 4,2 3! 412! 


different ways to order them. 

Finally, let us still mention that in the literature one sometimes finds another 
(equivalent) way for the introduction of the multinomial coefficients. Given nonnegat- 
ive integers ky, ..., kj with ky + --- + km =n, it follows that 


n n\ (n-ky\ (n-k-ky n-k---—-Kkn-4 
a : - 7 (;.) ( ky ) ( ks )-( ken ) (A.16) 


A direct proof of this fact is easy and left as an exercise. 


4 The huge size of this number explains why playing skat never becomes boring. 
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There is a combinatorial interpretation of the expression on the right-hand side of 
eq. (A.16). To reorder n balls of m different colors, one chooses first the k; positions 
for balls of color 1. There are (i, ) ways to do this. Thus, there remain n — k; possible 
positions for balls of color 2, and there are CL) possible choices for this, and so on. 
Note that at the end there remain kj, positions for kj, balls; hence, the last term on the 


right-hand side of eq. (A.16) equals 1. 
Let us come now to the announced generalization of Proposition A.3.7. 


Proposition A.3.18 (Multinomial theorem). Let n > 0. Then for any m > 1 and real 
numbers X1,...,Xms 


n 
ky+---+km=n Kiyteng Gn 
k;20 


Proof: Equality (A.17) is proved by induction. In contrast to the proof of the binomial 
theorem, now induction is done over m, the number of summands. 

If m = 1 the assertion is valid by trivial reasons. 

Suppose now eg. (A.17) holds for m, all n > 1 and all real numbers x1, ..., Xm. We 
have to show the validity of eq. (A.17) for m +1 and all n > 1. Given real numbers 
X1,..-,Xme1 andn > 1sety := x1 ++ + -+Xm. Using A.3.7, by the validity of eq. (A.17) for 
mand alln-—j, 0 <j <n, we obtain 


n 


(x1 ++ + + Xai)” = (V+ Xai)” = = 


j=l 


n! 


apm 


n S 
n! (n-j)! ee 5 
= PS ee eae Ogee, ¢ mJ 
ii(n—7)! » ee nee m “m+1° 
a ji(n-j)! bees ky! km! 
kj20 


Replacing j by kn+; and combining both sums leads to 


n! k k 
n_ 1 1 
(x t+ + +Xma1)”" = > kilo ket 4 ee, Cana 
k _ Kass + + Kms: 
yt tkmy1=n 
k;>0 


hence eq. (A.17) is also valid for m + 1. This completes the proof. a 


ait? 


Remark A.3.19. The number of summands in eq. (A.17) equals? ( is 


5 Compare case 3 in Section A.3.2. 


AAppendix —— 379 


A.4 Vectors and Matrices 


The aim of this section is to summarize results and notations about vectors and 
matrices used throughout this book. For more detailed reading we refer to any book 
about Linear Algebra, for example, [Ax115]. 

Given two vectors x and y in R", their® scalar product is defined as 


n 
(x, y) = > xy X= 0456.5) y=(¥1,.--5Yn)- 
jel 


If x ¢ R", then 


f 1/2 
xl = (x)? = | Sx? 
jel 


denotes the Euclidean distance of x to 0. Thus, |x| may also be regarded as the length 
of the vector x. In particular, we have |x| > 0 for all nonzero x € R". 

Any matrix A = (aij); of real numbers ajj generates a linear’ mapping (also 
denoted by A) via 


Ax = DS atyxj, nea So atx) » X=(%y,...,Xn) € R". (A.18) 


Conversely, any linear mapping A : R” > R" defines a matrix (aj) oe by representing 


Ae; « R" as 
Ae; = (ayj,...,Qnj),  J=al,...,n. 


Here e; = (0,..., 0, wy 0..., 0) denotes the jth unit vector in R". With this generated 
j 
matrix (aij); 4 the linear mapping A acts as stated in eq. (A.18). Consequently, we may 
always identify linear mappings in R” with n x n-matrices (aij); 4: 
A matrix A is said to be regular® if the generated linear mapping is one-to-one, 
that is, if Ax = 0 implies x = 0. This is equivalent to the fact that the determinant 
det(A) is nonzero. 


6 Sometimes also called “dot-product”. 
7 Amapping A : R" > R" is said to be linear if A(ax + By) = aAx + BAy for alla, B ¢ Randx,y « R". 
8 Sometimes also called nonsingular or invertible. 
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Let A = (ai); be an n x n matrix. Then its transposed matrix is defined as 


A! := (ai); With this notation it follows for x, y « R" that 


(Ax, y) = (x,y) : 
A matrix A with A = A’ is said to be symmetric. In other words, A satisfies 


(Ax,y) = (Ay), x%yeR". 


n 


An nx nmatrix R = (Tij) jot is positive definite (or shorter, positive) provided it is 


symmetric and that 


n 
(Rx, X) = es >0, x=(%,...,Xn) #0. 
ijel 


We will write R > 0 in this case. In particular, each positive matrix R is regular and its 
determinant satisfies det(R) > 0. 
Let A = (ai); be an arbitrary regular n x n matrix. Set 


ijl 
R:= AA’, (A.19) 
that is, the entries rj of R are computed by 
n 
rij = So ainaix, Leg een: 


k=1 


Proposition A.4.1. Suppose the matrix R is defined by eq. (A.19) for some regular A. 
Then it follows that R > 0. 


Proof: Because of 
RT = (aar)’ - (a7)' a? - AAT=R, 
the matrix R is symmetric. Furthermore, for x « R" with x # 0 we obtain 
(RX, X) = (Aarx, x) = (a7x, ATx) = |ATx|?>0. 
Hereby we used that for a regular A the transposed matrix A’ is regular 


too. Consequently, if x#0, then A’x#0, thus |A’x|>0. This completes the 
proof. a 
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The identity matrix J, is defined as n x n matrix with entries 6j, 1 < i,j <n, where 
i) Se hey 
by = | : a (A.20) 


Of course, I,x = x for x « R". 

Given a regular n x n matrix A, there is unique matrix B such that AB = Jy. B is 
called the inverse matrix of A and denoted by A~!. Recall that also A7!A = I, and, 
moreover, (A‘)"! = (A7!)!, 

Ann x n matrix U is said to be unitary or orthogonal provided that 


UU! =U'U=I, 


with identity matrix I,. Another way to express this is either that U’ = U™! or, 
equivalently, that U satisfies 


(Ux, Uy) = (x,y), x, ye R". 
In particular, for each x ¢ R"” it follows 
|Ux|? = (Ux, Ux) = (x, x) = |x/?, 


that is, U preserves the length of vectors in R". 

It is easy to see that an n x n matrix U is unitary if and only if its column vec- 
tors UWj,...,U, form an orthonormal basis in R”. That is, (ui, uj) = 6, with dys as 
in (A.20). This characterization of unitary matrices remains valid when we take the 
column vectors instead of those generated by the rows. 

We saw in Proposition A.4.1 that each matrix R of the form (A.19) is positive. Next 
we prove that conversely, each R > 0 may be represented in this way. 


Proposition A.4.2. Let R be an arbitrary positive nxn matrix. Then there exists a regular 
matrix A such that R = AA’. 


Proof: Since R is symmetric, we may apply the principal axis transformation for sym- 
metric matrices. It asserts that there exists a diagonal matrix’ D and a unitary matrix 
U such that 


R=UDU'. 


Let 6;,..., dn be the entries of D at its diagonal. From R > 0 we derive 6; > 0,1 <j<n. 
To see this fix j < n and set x := Ue; where as above e; denotes the jth unit vector in R”. 


9 The entries dj of D satisfy dy = 0 ifi ¢ j. 
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Then U’x = ej, hence 
0 < (Rx, x) = (UDUTx, x} = (DUTx, UTx) = (Dej, e) = 5. 


Because of 6; > O we may define D2 as diagonal matrix with entries Bf on its 
diagonal. Setting A := UD", because of (D"?)? = D’” it follows that 


R = (UD"2)(UD"”)! = AA’. 


Since |det(A)|? = det(A)det(A’) = det(R) > 0, the matrix A is regular, and this completes 
the proof. a 


Remark A.4.3. Note that representation (A.19) is not unique. Indeed, whenever 
R=AA’, then we also have R = (AV)(AV)! for any unitary matrix V. 


A.5 Some Analytic Tools 


The aim of this section is to present some special results of Calculus that play an im- 
portant role in the book. Hereby we restrict ourselves to those topics that are maybe 
less known and that are not necessarily taught in a basic Calculus course. For a gen- 
eral introduction to Calculus, including those topics as convergence of power series, 
fundamental theorem of Calculus, mean-value theorem, and so on, we refer to the 
books [Spi08] and [Ste15]. 
We start with a result that is used in the proof of Poisson’s limit theorem 1.4.22. 
From Calculus it is well known that for x ¢ R 
x n 
lim (1+ ~) =e, (A.21) 
N00 n 
The probably easiest proof of this fact is via the approach presented in [Spi08]. There 
the logarithm function is defined by Int = / i lds, t > 0. Hence, l’H6pital’s rule implies 


lim t In (1+) =x, xeR. 


too 


From this eq. (A.21) easily follows by the continuity of the exponential function. 
The next proposition may be viewed as a slight generalization of eq. (A.21). 


Proposition A.5.1. Let (Xn)n>1 be a sequence of real numbers with limn+oo Xn = X for 
some x € R. Then we get 


n 
lim (1+ =) =e", 
n->co n 
Proof: Because of eq. (A.21) it suffices to verify that 


Jim |(1+)"- (1+ 5)’ 


=0. (A.22) 
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Since the sequence (x,)n>1 is converging, it is bounded. Consequently, there is ac > 0 
such that for all n > 1, we have |x;| < c. Of course, we may also assume |x| < c. Fix for 
a moment n > 1 and set 


x x 
a:=1+— and b:=1+-. 
n n 
The choice of c > 0 yields |a| < 1+ c/n as well as |b| < 1+ c/n. Hence it follows 


ja" = b" |= |a@ bi ja" +" bs --4 ab + 


< |a-D| (a+ Jal" |b] +- - - + Jal|b|""? + |b|"") 


n-1 
<ja-b\n (1+ <) <Cnla—-b|. 


Here C > O is some constant that exists since (1+ c/n)""! converges to e°. By the 
definition of a and b, 


[Xn — X| 


<Cn =C|x-Xxyl. 


Since x, > x, this immediately implies eq. (A.22) and proves the proposition. a 


Our next objective is to present some properties of power series and of the functions 
generated by them. Hereby we restrict ourselves to such assertions that we will use in 
this book. For further reading we refer to Part IV in [Spi08]. 

Let (ax)x0 be a sequence of real numbers. Then its radius of convergence r « 
[0, co] is defined by 


1 
r:= ————_—_.. 
lim sup |a,|/ 


k-0o 
Hereby we let 1/0 := oo and 1/00 := 0. If0 < r < co and |x| < r, then the infinite series 
f@) = Do axx* (A.23) 
k=0 
converges (even absolutely). Hence the function f generated by eq. (A.23) is well- 


defined on its region of convergence {x « R : |x| < r}. We say that f is represented as 
a power series on {x « R: |x| <r}. 
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The function f defined by eq. (A.23) is infinitely often differentiable on its region 
of convergence and 


fH) = S K(k 1)---(k-n+Dagxk™ 
k=n 


= Sok +n) (k+n-1)- -(k+D Aran XK =n! a ee. drinX*. — (A.24) 
k=0 k=0 


(compare [Spi08], §27, Thm. 6) 

The coefficients ni("t*) An+k in the series representation of the nth derivative f™ 
possess the same radius of convergence as the original sequence (ax)x>0. This is easy 
to see for n = 1. The general case then follows by induction. 

Furthermore, eq. (A.24) implies an = f(0)/n!, which, in particular, tells us that 
given f, the coefficients (ax)xs0 in representation (A.23) are unique. 


Proposition A.5.2. Ifn >1and |x| < 1 then it follows 
1 ~ (-1\ x 
G+x" 2 k}* >) 
k=0 


Proof: Using the formula to add a geometric series and applying () = (-1) yields 
for |x| < 1 that 


1 ~~ ck _wo{)\ x 
1+Xx Di ex (4) : 
k=0 k=0 


Consequently Proposition A.5.2 holds for n = 1. 
Assume now we have proven the proposition for n - 1, that is, if |x| < 1, then 


1 a (-n+1) x 
apa k Ve 


Differentiating this equality on the region {x : |x| < 1} implies 


- aa = > oe ‘) kx = > (2) (k +1) x*. (A.26) 
kal 


k=0 


Direct calculations give 


k+1 /-n+1 k+1 (-n+1)(-n)-- -(-n+1-(k+1)+1) 
Chea) n=1— (k+1)! 
_ GGn=1)-+>Gn —“2- (7) 


k! 
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which together with eq. (A.26) leads to 


1 —n\ x 
Gay ( K)* ' 
k=0 
This completes the proof of Proposition A.5.2. a 


The next proposition may be viewed as a counterpart to eq. (A.10) in the case of 
generalized binomial coefficients. 


Proposition A.5.3. Fork >Oandm,neN, 
k 


xs) Ce") 


j=0 


Proof: The proof is similar to that of Proposition A.3.8. Using Proposition A.5.2 we 
represent the function (1+ x)"""™ as power series in two different ways. On the one 
hand for |x| < 1 we have the representation 


1 w(-n-m)\ , 
cam ot k ye ee 


and on the other hand 


come [EGE] 


EEG) EEC 6 


As observed above the coefficients in a power series are uniquely determined. Thus, 
the coefficients in eqs. (A.27) and (A.28) have to coincide, which implies 
k 
GG) Ce") 
my VIER) k 


as asserted. | 


Let f : R" + R bea function. How does one define the integral San f(x) dx ? To 
simplify the notation let us restrict ourselves to the case n = 2. The main problems 
already become clear in this case and the obtained results easily extend to higher 
dimensions. 
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The easiest way to introduce the integral of a function of two variables is as 
follows: 


[rovace | [ foswsden dx. 


In order for this double integral to be well-defined we have to assume the existence of 
the inner integral for each fixed x, ¢ IR and then the existence of the integral of the 
function 


xy [ fea. xan. 


Doing so the following question arises immediately: why do we not define the integral 
in reversed order, that is, first integrating via x; and then with respect to x? 
To see the difficulties that may appear let us consider the following example. 


Example A.5.4. The function f : R* > Ris defined as follows (Fig. A.1): If either x; < 0 
Or X2 < Oset f(x, X2) = 0. If x1, x2 > O define f by 


tL: Xp SX0< X41 
f0a,X2) =} -1 xy t1sx2< x, +2 
O : otherwise 


We immediately see that 


co 


[ fou) dx = 0 for all x, « R, hence [ [ Fea) dre dx, =0. 
0 o Lo 


Figure A.1: The function f. 
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On the other side it follows 


a Jo?) dy = x2 : O<xH<1 
[ oa.) dx, =} (3?"(-1) dx + feattdx=2-% + sx <2 
Seo of Oa, x2) dx, =0 : 2<Xx2< 00 


leading to 


J) free toaiso- [| fox) dx. 
0 Lo 0 


0) 


Example A.5.4 shows that neither the definition of the integral of functions of several 
variables nor the interchange of integrals are unproblematic. Fortunately, we have the 
following positive result (see [Dur10], Section 1.7, for more information). 


Proposition A.5.5 (Fubini’s theorem). If f(x1, x2) > 0 for all (x1, xz) ¢ R?, then one may 
interchange the order of integration. In other words, 


J | fro a dx = / fre to dx. (A.29) 


—oo co —oco co 


Hereby we do not exclude that one of the two, hence also the other, iterated integral is 
infinite. 

Furthermore, in the general case (f may attain also negative values) equality (A.29) 
holds provided that one of the iterated integrals, for example, 


J] freien dx) 


—oo [,oo 


is finite. Due to the first part then we also have 


/ | [vis.sava dx, < 00. 


—oo co 


Whenever a function f on R? satisfies one of the two assumptions in Proposition A.5.5, 


then by 
[ fooex:- /  [ns.s0e dx = / [ras dx; 
R2 -co |[-oo -co [Loo 
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the integral of f is well-defined. Given a subset B ¢ R? we set 


/ fl) dx := / FO) Ip dx, 
: 


provided the integral exists. Recall that 1g denotes the indicator function of B 
introduced in eq. (3.20). 

For example, let K; be the unit circle in R?, that is, Ky = {(, x2) : xf +. x5 < 1}, then 
it follows 


i fs) dx = | i fer, x2) dx dx 


K afiad 


Or, if B = {(x1, X2, X3) € R? : x4 < Xo < x3}, we have 


co X3 XQ 


Jia [ [| fou) dn ded. 


—00 —00 —0O 


Remark A.5.6. Proposition A.5.5 is also valid for infinite double series. Let aj be real 
numbers either satisfying aj > 0 or )°;*5 Yj-0 |a;;| < oo, then this implies 


co co co co co 

~ | | gs ~ 
aij = ay =) ay. 

i=0 j=0 j=0 i=0 i,j=0 


Even more generally, if the sets Ik € Ni, k € No, form a disjoint partition of N2, then 


See See 


ij=0 k=0 (i,jfely 


For example, if I, = {(i,j) « N? :i+j =}, then 


3 aij = a > aij = > Qik-i-+ 


i,j=0 k=0 (ijel; k=0 i=0 
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